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PREFACE 


Yiqi Luo, School of Integrative Plant Science, Cornell University, Ithaca, USA 


Benjamin Smith, Hawkesbury Institute for the Environment, 
Western Sydney University, Richmond, Australia 


The ecosystems of the vegetated land surface are 
critical to the health of the planet, its biodiversity, 
and people, providing benefits, so-called ecosys- 
tem services, on which human health, well-being 
and economic activity rely. The carbon cycle con- 
nects ecosystems on land to the atmosphere and 
the climate system. Rising temperatures, changing 
rainfall distributions, and more frequent extreme 
weather impact ecosystem services, often in a neg- 
ative way. Rising atmospheric carbon dioxide con- 
centrations — the main proximal cause of climate 
change — also affect ecosystems directly, enhanc- 
ing photosynthesis and plant water-use efficiency 
in experimental settings, and almost certainly in 
nature. Due to a small imbalance between global 
photosynthesis, absorbing carbon dioxide from 
the atmosphere, and the return flux (mainly 
decomposition of litter and soil carbon) from the 
land to the atmosphere, a sizeable proportion of 
anthropogenic CO, emissions are reabsorbed by 
ecosystems — the land carbon sink. A key goal of 
international climate policy, to supplement fos- 
sil fuel reductions with negative emissions over a 
transitional period, largely relies on management 
interventions to preserve and enhance the land 
carbon sink. The land carbon cycle, then, is cen- 
tral both to mitigation (emissions reduction) and 
adaptation (management of climate impacts and 
risks) responses and policies. Developing effective 
measures requires predictions of how the system 
may be expected to respond under various scenar- 
ios. For this, of course, we need models. 


While relevant to supporting climate science 
and policy, modeling is also gaining “market 
share” in environmental and ecological research 
generally. Several forces are involved in this trend. 
Ever increasing volumes of readily available data 
from observational and experimental networks are 
making it easier to parameterize and robustly vali- 
date models. Cheaper, more powerful computers, 
including cloud computing services, bring com- 
plex numerical algorithms and data assimilation 
methods within reach for many applications. New 
statistical and optimization methodologies have 
been developed and made accessible through con- 
venient packages and toolboxes, useful not only 
to modelers but as platforms for collaboration 
between modelers and empirical scientists. 

This book provides an overview of the current 
state of the land carbon cycle modeling field, exem- 
plifying recent developments as described above. 
The book is built upon a summer training course, 
New Advances in Land Carbon Cycle Modeling, held annu- 
ally since 2018 at Northern Arizona University. 
Over the four years the course has been offered, 
attendees from 32 countries in six continents have 
undertaken the training. The first training course 
in 2018 attracted about 40 participants, with a 
similar number in 2019. Due to the pandemic of 
COVID-19, the in-person course was replaced by 
an online version in 2020. Originally, we planned 
to have 25 attendees, but ended up with nearly 85 
participants from six continents. This grew further 
to 150 virtual attendees in 2021. 


This book is mainly based on 31 lectures 
(including pre-training lectures) and ten practices 
prepared for the New Advances in Land Carbon Cycle 
Modeling training course in 2020 and 2021. 

The book offers cutting-edge knowledge and 
techniques on carbon cycle science and modeling. 
We have designed ten training units in such a way 
that everyone can gain regardless of their prior 
background in modeling. The chapters range from 
theoretical foundation of land carbon cycling, 
traceability, and data assimilation to machine 
learning and ecological forecasting. Overall, four 
techniques are covered: the matrix approach to 
land carbon modeling; data assimilation for data- 
driven modeling; ecological forecasting; and com- 
bined machine learning with data assimilation to 
improve model prediction. 

The organization of the book aligns with the 
training course, which has two blocks. The first 
block in units 1-5 is about the matrix approach 
to land carbon cycle modeling. The second block 
in units 6—10 is on data assimilation, ecological 
forecasting, and machine learning. 

The matrix approach introduced in units 1-5 
first describes a matrix equation. It is demonstrated 
that the matrix equation can unify land carbon 
cycle models; offer a new theoretical framework to 
guide carbon cycle research; help accelerate com- 
putational efficiency for spin-up; offer new ana- 
lytics to diagnose model performance; and allow 
data assimilation of complex models. Five skills are 
covered in units 1-5, namely: drawing the carbon 
flow diagram and writing carbon balance equa- 
tions of a model; developing matrix models from 
carbon balance equations and coding the matrix 
model; adding diagnostic variables to matrix mod- 
els; adding semi-analytic spin-up (SASU) algo- 
rithms; and traceability analysis. 

The matrix equation can be used to derive three 
diagnostic variables: carbon input, residence time, 
and carbon storage potential. The matrix equa- 
tion can also be used to get an analytic solution of 
steady-state pool sizes, leading to SASU. The matrix 
equation is the foundation for traceability analysis. 
The chapters of units 1-5 explain these new skills. 

Units 6-10 cover data assimilation, ecological 
forecasting, and machine learning. To realistically 
forecast ecological responses to climate change, 
three elements all need to be perfectly aligned: 
model structure, model parameterization, and the 
external forcing variables. The matrix approach 
discussed in units 1-5 is about process-based 


model structure. Data assimilation and machine 
learning in units 6-8 and 10 will help improve 
model parameterization. EcoPAD, a workflow sys- 
tem that is described in unit 9, will link real-time 
forcing to model forecasting. 

Chapters in unit 6 describe the seven-step 
procedure of data assimilation. The seven steps 
are: defining a research objective; acquiring data 
sets; using one model; developing a cost func- 
tion; minimizing mismatches between modeled 
and observed values with a global optimization 
method; estimating parameters; and predicting 
ecosystem responses. Chapters in units 7 and 8 
are about applications of data assimilation to the 
SPRUCE field experiment and satellite observa- 
tions, and evaluation of values of different data sets 
to constrain models and their predictions. 

Chapters in unit 9 describe the Ecological 
Platform for Assimilating Data (EcoPAD) frame- 
work, which automatically ingests data into a 
model through a data assimilation system for eco- 
logical foresting. 

Chapters in unit 10 introduce machine learn- 
ing, a PROcess-guided deep learning combined 
with DAta-driven modeling (PRODA) approach, 
and its application to optimize parameterization of 
the CLM5 land surface model. 

Most of the chapters are written in such a way as 
to be understandable by readers with minimal mod- 
eling background. Practices are targeted at a suitable 
level for such readers. A few chapters may require 
some mathematical background to be fully under- 
stood. The book offers three appendix chapters on, 
respectively: basic linear algebra; introductory pro- 
gramming with Python; and the Carbon Training 
(CarboTrain) package we use as a toolbox for the 
training course and the practice chapter in each unit. 
Depending on their prior knowledge, readers may 
choose to read these appendix chapters as optional 
supporting material to the chapters in units 1-10. 

All the chapters are accompanied with pre- 
recorded lectures or practice instruction.These pre- 
recorded videos are available at https: //www?. 
nau.edu/luo-lab/download/4th_training_course. 
php. (Please search for the videos using “ecolab 
Yiqi Luo” if the website is moved away from NAU.) 
If you plan to master skills described in the chap- 
ters, you may find it useful to read the book chap- 
ter; listen to the corresponding video; take the quiz 
at the end of each chapter after watching the video; 
and attempt the practice chapter at the end of each 
unit, following the pre-recorded instruction. 


x PREFACE 


Please be aware that this book does not teach 
programming or how to code a model, nor does it 
teach how to do model development or modifica- 
tion. Readers with limited programming experi- 
ence may, however, find the brief introduction to 
Python coding in appendix 2 useful. 

The open access electronic version of this book 
has been made available thanks to financial con- 
tributions by Northern Arizona University, Lund 


University, and Oak Ridge National Laboratory. 
Finally, we wish to thank all 22 authors who have 
worked very hard for months to prepare the mate- 
rial for this book. We hope that you, the reader, 
find it useful and rewarding. 


YIQI LUO, 
BEN SMITH, 
September, 2021. 
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The land carbon cycle has been extensively studied 
yet its fundamental properties have not been fully 
understood. This chapter offers empirical evidence 
to demonstrate a general dynamic pattern that the 
land carbon cycle changes in a direction toward 
a moving attractor in response to global change. 
This general pattern is captured by a matrix equa- 
tion. The relatively simple matrix equation can 
unify land carbon cycle models, accelerate com- 
putational efficiency for spin-up, diagnose model 
performance with new analytics, and enable data 
assimilation with complex models to improve 
their predictive skills, and guide carbon cycle 
research with a new theoretical framework of 
dynamic disequilibrium. 


CONVERGENCE OF THE LAND CARBON CYCLE 


In the late 2010s, a deserted village, Houtouwan, 
on an island off the east coast of China was 
discovered to be completely overrun by veg- 
etation approximately 20 years after dwellers left 
(Smithsonian Channel, The Abandoned Chinese 
Village that Nature Reclaimed) (Figure 1.1). What 
might happen to the place in another 20 years or 
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even longer? It is likely that trees will gradually 
take over to form a coastal forest. 

In fact, Mother Nature would overrun all urban 
places in the world if human disturbances were 
removed. Imagine that we could magically remove 
humans from any highly commercialized, heavily 
human-disturbed urban areas, such as Manhattan 
of New York City or Lujiazui of Shanghai, for 300 
years: the place would soon be overrun by plants, 
animals and microbes. Without any human activi- 
ties, small trees would grow from cracks in con- 
crete in five years, most of the high-rise buildings 
would collapse and forests would probably take 
over in 50 years. In 300 years, Manhattan would 
most likely be occupied by a deciduous forest sim- 
ilar to those in northeastern USA, and Lujiazui of 
Shanghai by some lowland forests. 

Similarly, it has been repeatedly observed how 
vegetation takes over landscapes after natural dis- 
turbances occur. For example, following the 1988 
Yellowstone fires — massive blazes that burned about 
1.2 million acres in and around Yellowstone National 
Park — their size and severity led to a proclamation 
that Yellowstone had been destroyed. The burned 
landscape was retaken by thriving young lodgepole 


Figure 1.1. Overrunning of the deserted village of Houtouwan by vegetation approximately 20 years after dwellers left the 


island that is situated on off the east coast of China. Vegetation taking over after anthropogenic and natural disturbances are 


removed is caused by the internal processes of the carbon cycle that drive an ecosystem toward an attractor. 


pine trees 30 years after the fires (Turner 2018). 
Secondary succession is an ecological term for this 
entirely natural process. Ecosystem succession has 
been extensively studied, mainly from the perspec- 
tives of species dynamics and community structures. 

From a carbon cycle perspective, ecosystems 
converge toward some attractor states after anthro- 
pogenic and natural disturbances are removed. 
(Note that the states that ecosystems converge 
toward are moving attractors under global change. 
This point will be discussed later). This conver- 
gence is done by “Mother Nature” and actually 
results from the internal processes of the land car- 
bon cycle. What, then, are those internal processes? 
And, how can we mathematically represent them? 


DONOR POOL-DOMINANT TRANSFER AND 
OTHER PROPERTIES THAT GOVERN THE LAND 
CARBON CYCLE 


Before we answer those questions, let us briefly 
review the land carbon cycle. Carbon enters an 
ecosystem via photosynthesis. Photosynthetic 
products are partly allocated for autotrophic res- 
piration and partly for growth of leaf, stem, and 
root. Once a plant or its parts die, they become 
litter and enter the litter pools. Litter decomposes, 
partly released to the atmosphere via heterotrophic 


respiration and partly incorporated into soil to 
become soil organic matter. Soil organic matter 
goes through decomposition and stabilization over 
and over again, driving soil carbon cycling. We will 
examine some of the processes to see what the best 
equation would be to represent them. 

Let us first look at litterfall in which dead leaves 
fall from the tree canopy to the ground. To make it 
a subject of study, we define two new terms: donor 
pool and recipient pool. The donor pool donates 
litter whereas the recipient pool receives litter 
(Figure 1.2a). Litterfall is a rate process that moves 
carbon from the donor pool to the recipient pool. 
In this case, the rate of litter falling is proportional 
to the amount of litter in the donor pool while the 
amount of litter in the recipient pool has nothing 
to do with the rate of litterfall at all. Thus, the rate 
of litterfall is controlled by the donor pool. 

The rate of litterfall, donated by dX(t) /d(t), equals 
the donor pool size, X(t), times a coefficient (k) as: 


(1.1) 


This equation describes the donor pool-dominated 
carbon transfer (Figure 1.2a). Litterfall in the real 
world is also affected by wind and other environ- 
mental conditions over seasons. Modelers usually 
use an environmental scalar, £(t), to account for 
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Figure 1.2. Macroscopic patterns of carbon transfer processes and distributions. The carbon transfer processes include (a) 
litterfall; (b) decomposition of ragweed (Ambrosia psilostachya) litter (Cheng et al. 2010); (c) soil organic carbon decomposi- 


tion from incubation at 35 *C (replotting the data from Haddix et al. 2011); (d) rates of absolute carbon (C) change during 


the forest succession (Yang et al. 2011); (e) changes in soil organic matter after agricultural cultivation in Alberta, Canada 
(extracted data from Doomaar 1979); and (f) a vertical distribution of soil carbon with depth in Malawi, Africa, from a soil 


carbon database. The macroscopic patterns of almost all carbon transfer processes and distributions can be described by the 


first-order decay function. 


the effects of phenology, wind and other environ- 
mental factors on litterfall as: 


(1.2) 


Once litter has fallen on the ground, it decom- 
poses. Litter decomposition is usually studied 
with litterbags or wood logs on ground or in air. 
Researchers initiate a study with a certain amount of 
litter in litterbags, place them in the field, then collect 
a subset of litterbags once every few weeks, find the 
dry weight, and calculate the dry weight remaining 
in comparison with the initial amount. A typical data 
set of litter decomposition shows that mass remain- 
ing becomes less and less as sampling time goes 
on (Figure 1.2b). This type of data set can usually 
be fitted by a first-order decay curve as in Equation 
1.1.Thus, litter decomposition is often described by 
the same donor pool-dominated carbon transfer equation. In 
this case, k is often called the litter decay constant. 
Actually, k varies with litter type and location. Zhang 


et al. (2008) synthesized nearly 300 datasets from 
70 studies all over the world. The study found that 
Equation 1.1 fits all data sets very well although the 
value of k greatly varies with litter types and envi- 
ronment. Cai et al. (2018) synthesized more than 
1,600 data sets of straw decomposition for six types 
of crops. Their study fits a three-exponent equation 
to all the data sets. Thus, straw decomposition also 
follows the donor pool-dominated carbon transfer. 

Another important process of the carbon cycle 
is soil organic carbon (SOC) decomposition. SOC 
decomposition is usually studied by soil incuba- 
tion. That is, researchers collect soil samples from 
the field and put the samples in jars for a period of 
time. They then collect gas samples once in a while 
to measure the amount of carbon released from 
the soil sample through microbial respiration. Data 
are usually plotted either by measured CO, release 
or cumulative CO, released on the y-axis with time 
on the x-axis (Figure 1.2c). Almost all data fol- 
low a similar pattern. This pattern also can be well 
described by donor pool-dominated transfer. 
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Schădel et al. (2013) fitted data of SOC decom- 
position with one-, two-, and three-pool models. 
They found that two or three-pool models work 
well for SOC decomposition. Schádel et al. (2014) 
synthesized more than 120 data sets from perma- 
frost regions. Xu et al. (2016) synthesized nearly 
400 data sets from different locations all over the 
world. Both of the latter studies found that all their 
data sets (more than 500) can be well fitted by 
two or three-pool models. This suggests that the 
donor pool-dominated carbon transfer equation 
also works for SOC decomposition. 

Yang, Luo, and Finzi (2011) synthesized more 
than 124 studies of soil carbon dynamics at differ- 
ent stages of forest succession. During the course 
of secondary succession, some forests gain carbon 
and some lose carbon (Figure 1.2d). In either 
case, carbon dynamics still can be described by the 
donor-pool dominated carbon transfer. Moreover, the soil 
organic matter remaining after long-term cultiva- 
tion (Figure 1.2e) and the vertical distribution of 
soil organic carbon with depth (Figure 1.2f) both 
follow monotonic patterns, consistent with the 
donor-pool dominated carbon transfer. 

So far, we have examined macroscopic patterns 
of key carbon transfer processes (e.g., litterfall, lit- 
ter decomposition, and SOC decomposition) and 
distributions. These macroscopic patterns are very 
typical as almost ubiquitously revealed by thou- 
sands of field and laboratory studies. The macro- 
scopic patterns can be well described by the donor 
pool-dominated carbon transfer equation. 

The donor pool-dominated carbon transfer, then, is one 
of the four fundamental properties that govern 
dynamics of the land carbon cycle. The other three 
properties are (1) photosynthesis as the primary 
carbon influx pathway, (2) compartmentalization 
of carbon processes into plant, litter, and soil, and 
(3) the first-order kinetics of carbon transfer from 
the donor pool (Luo and Weng 2011) (Equation 
1.1). Among the four properties, the donor pool- 
dominated carbon transfer is the most important property 
in determining the trajectory of the land carbon 
cycle (Luo, Keenan, and Smith 2015). If this prop- 
erty is altered in our model, the carbon cycle will 
not behave as we have observed in the real world. 

The four properties fundamentally character- 
ize the internal processes of the land carbon cycle. 
The internal processes drive the land carbon cycle 
to converge toward an attractor (Luo et al. 2017). 
This is the reason why places like Manhattan 
could become deciduous forests and Liujiazui of 


Shanghai could become lowland forests if human 
disturbances were removed. This convergence is 
applicable to almost any place on Earth. 

Active research is going on to incorporate 
microbial processes and traits into carbon cycle 
models to account for the important role of 
microbes in catalyzing decomposition of soil 
organic matter. Many of the microbial models were 
developed on purpose because the responsible 
researchers suspected that models based on first- 
order kinetics of carbon transfer from donor pools 
were too simple. However, these newly developed, 
microbially-based models are usually nonlinear 
and do not fit the observed macroscopic patterns 
of litter and SOC decomposition well (e.g., Liang 
et al. 2018). It remains challenging that litter and 
SOC decomposition models not only adequately 
represent microbial processes but also are consis- 
tent with the macroscopic patterns observed from 
almost all experimental studies. 

Next, we examine how well these four proper- 
ties are represented in models. 


THE MATRIX APPROACH TO MODEL 
REPRESENTATION OF THE LAND CARBON 
CYCLE 


The four properties identified above are all well 
represented in models as long as the models use 
a so-called pool-and-flux or box-and-arrow struc- 
ture. In this structure, we use pools to represent 
different carbon compartments and fluxes to rep- 
resent carbon transfer among compartments or 
carbon input into and output out of the ecosystem. 
For this structure, we need one carbon balance 
equation to trace how much carbon gets into one 
pool and how much carbon leaves the pool. For 
example, the leaf pool, X,, receives carbon from 
photosynthesis partitioned to leaves and loses car- 
bon by senescence (Figure 1.3). Thus, we need 
one equation to calculate the amount of carbon 
resulting from photosynthesis and the amount of 
carbon lost to litterfall. The equation to describe 
the dynamics of carbon balance in the leaf pool 
over time can be represented by: 


(1.3) 


where u(t) represents the amount of carbon input 
from net primary production (NPP, photosynthesis 
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Figure 1.3. A generalized matrix model of the terrestrial carbon cycle. (a) The basic carbon cycle processes are represented 
by four fundamental properties for all terrestrial ecosystems. (b) The four properties have been incorporated into terrestrial 
carbon cycle models with a pool-and-flux structure. (c) The structure is typically encoded using a set of balance equations with 
carbon input into and output from each pool. (d) The balance equations of terrestrial carbon cycle models can be converted 
to a matrix equation. Thus, the matrix equation can be considered as a general system equation (or a dynamical equation) for 
the terrestrial carbon cycle. 


minus autotrophic respiration), b, is the carbon 
partitioning from NPP to leaves, k, is the rate of 
senescence, and ¿(t) is an environmental modi- 
fier. Thus, the change in the carbon pool size in 
aX, (t) 
dt 


the leaf pool ( ) equals carbon input to the 


asıkıXı 


a7sksXs 


+ apkoXa(0)- x, (t) | 


) 
t)+as,k,X, (t) + k3X;(t) —k;X; (9] 


leaf pool b u(t) minus carbon leaving the leaf pool 
EWKX, (D. 

Extending this idea to all the eight pools of 
the Terrestrial Ecosystem (TECO) model, we have 
eight carbon balance equations to track carbon 
cycling in the ecosystem as: 


litter 


(1.4) 


acskaX, (t) + assksXs (8) + aszk7X; (t) + deaksXe(t) —k¢X¢ (:) | 
t)+ a7koXs(t) =X% (:) | 


( 
| assksXs (t)+ a37k,X7(t) — kX; (:) | 


SOM 
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where X,, i = 1, 2, ..., 8, is the amount of carbon, 
respectively, in three plant, two litter, and three 
SOC pools; b,, b,, and b, are plant partitioning coef- 
ficients to leaf, root, and wood; k, i = 1, 2, ..., 8, 
is the rate of carbon leaving the eight pools (i.e., 


dx (t)/d) fb -1 
dx,(t)/de | |b 0 
dX; (t)/dt | | bs 0 
dX4(t)/ dt 0 d 
a AN BOTS 
dX,(t)/dt 0 
dX; (t) / dt 0 
dX; (t) / dt 0 
kı 
k, 
k; 
ka 


In this equation, as, equals 1 as all the carbon 
from the wood pool goes to the structural litter 
pool. The above matrix equation can be succinctly 
expressed as: 


Bu(t)+ AE (t) KX(t) (1.6) 


where X(t) is a vector of pool sizes, B is a vector 
of partitioning coefficients from carbon input to 
each of the pools, w(t) is the carbon input rate, 
A is a matrix with —1 in the diagonal and trans- 
fer coefficients in the off-diagonal to quantify 
carbon movement along the pathways (equiva- 
lent to microbial carbon use efficiencies for car- 
bon transfer among soil pools), K is a diagonal 
matrix of process rates (mortality for plant pools 
and decomposition coefficients for litter and soil 
pools) from donor pools, £(t) can be a scalar or 
a diagonal matrix of environmental modifiers to 
represent responses of the carbon cycle to changes 


process rate or exit rate); q,i1=1,2,...,8 j= 1, 
2, ..., 8, is the transfer coefficient of carbon from 
pool j to pool i; and £(t) is the environment modi- 
fier. The eight carbon balance equations in Equation 
1.4 can be reorganized into a matrix form as: 


0 0 0 0 0 0 
—1 0 0 0 0 0 
0 = 0 0 0 0 
da 0 -1 0 0 0 0 
ds, 1 0 =1 0 0 0 
0 64 (65 -1 67 deg 
0 0 075 76 —1 0 
0 0 0 dgg gz —1 

x, (:) (1.5) 

X,(t) 

X3(t) 

A X,(t) 
Xs (+) 
kg 
X(t) 
k, 
x(t) 
kg 
Xs (1) 


in temperature, moisture, and oxygen. When £(t) 
is a diagonal matrix, the environmental modifiers 
can be the same for all the pools or different for 
individual pools. 

Equation 1.6 is generalizable to represent land 
carbon cycle models that follow first-order kinet- 
ics. It describes net carbon pool change, dX/dt, as a 
difference between carbon input p(t), distributed 
to different plant pools via partitioning coefficients 
B, and carbon loss through the transformation 
matrices (A6(t)K) among individual pools X(t). 

As Equation 1.6 is generic to unify the land car- 
bon cycle models that follow first-order kinetics, 
any single model is a special case of the generic 
equation. For example, the Community Land 
Model version 4.5 (CLM4.5) incorporates carbon 
transfer among seven pools per soil layer over 10 
layers (Koven et al. 2013). The seven pools in each 
layer are metabolic litter, cellulose litter, lignin lit- 
ter, coarse woody debris, fast soil organic matter, 
slow soil organic matter, and passive soil organic 
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matter. Carbon vertical transfers mainly occur 
between one layer and the next layer down. Huang 
et al. (2018b) converted the soil carbon module of 
CLM4.5 into one matrix equation as: 


ZO (9) (99 (9X( (17) 


The matrix equation above has three items on 
the right side, i.e., carbon input and plant parti- 
tioning, carbon transfers and release, and vertical 
movement. The dimension of the matrix equation 
is 70 for 70 pools. Transfer matrix A, for example, 
is a 7X7 block matrix. Each element of the block 
matrix, V(t), isa 10x10 matrix. The vertical move- 
ment matrix includes carbon transferring to the 
next layer up and the next layer down. 

Equation 1.7 shares similar mathematic proper- 
ties with Equation 1.6. Equation 1.7 has 70 dimen- 
sions whereas Equation 1.6 has eight dimensions. 
Equation 1.7 describes carbon transfers among 
70 pools over 10 soil layers where Equation 1.6 
describes carbon transfers among eight pools, 
including plant, litter, and soil pools without 
explicitly defined soil layers. 

Lu et al. (2020) converted the CLM5 carbon 
and nitrogen cycles into four matrix equations, 
two for vegetation carbon and nitrogen cycles 
and two for soil carbon and nitrogen cycles (see 
Chapter 6). The vegetation carbon cycle in CLM5, 
for example, contains 18 pools, including six tis- 
sue pools: leaf, fine root, live stem, dead stem, live 
coarse root, and dead coarse root. Each tissue pool 
is accompanied by a storage pool and a transfer 
pool. A crop grain tissue pool, accompanied by 
a grain storage pool and a grain transfer pool, is 
added when the crop model is used. Vegetation 
carbon dynamics are controlled by phenology for 
the leaf onset and offset dates according to air tem- 
perature and soil water conditions. Harvest and fire 
remove part of plant carbon from ecosystems and 
part goes to litter pools according to the harvest 
rate. Natural death moves carbon from plant pools 
to litter pools at defined mortality rates. The fire 
module triggers the occurrence of occasional fire 
events based on the amount of the fuel (i.e., litter) 
and the soil moisture. When fire occurs, carbon 
in plant pools is partly released to the atmosphere 
and partly transferred to the litter pools based on 
their tissue quality and burned area.Thus, the veg- 
etation carbon dynamics are represented by the 
following matrix equation: 


d 
Ba) (Age (1) Ke + A (0) m 
dt (1.8) 


+ Aso (t) Kix )X, (t) 


Here, X, is an n-entry vector, representing veg- 
etation carbon pool size. K is an n x n diagonal 
matrix. Its subscripts phe, gmc, and fic indicate car- 
bon processes related to phenology, gap mortality 
(i.e., harvest from land use and natural mortality), 
and fire, respectively. The diagonal entries are the 
process (or exit) rates of all vegetation carbon 
pools due to phenology (Kw), gap mortality (Kc), 
and fire (K,.). Once converted, the matrix equa- 
tions make CLM5 more modular, analytically clear, 
easily diagnosed, and computationally more effi- 
cient for spin-up. 

Equation 1.8 similarly shares mathematic prop- 
erties with Equation 1.6. Equation 1.8 uses three 
processes, phenology, mortality, and fire, to con- 
trol carbon transfers among 18 vegetation pools 
whereas Equation 1.6 only uses the environmental 
scalar £(t) to control carbon transfer. 

Some other global models, such as CABLE, LPJ- 
GUESS and ORCHIDEE, have also been converted 
to matrix equations. Once a model is converted to 
a matrix equation, it appears quite simple; a first- 
order differential equation. In fact, the equation is 
not that simple as it represents a nonautonomous 
system, which is discussed below. With this uni- 
fied matrix equation, we can explore the gen- 
eral properties of carbon models and the general 
behavior of the land carbon cycle. This is what we 
call the matrix approach. 

Note that matrix models themselves do not 
represent much significant advance in carbon cycle 
research. The matrix models are merely used to 
represent a set of differential equations in a matrix 
form to describe carbon cycling among multiple 
pools of the earth system (Bolin and Eriksson 
1958). What we propose is to take the matrix 
expression to unify the land carbon cycle models, 
enable data assimilation, and develop a theoretical 
framework to guide carbon research. 

The matrix representation can be adapted to 
accommodate nonlinear dynamics. For instance, 
Equation 1.6 can be modified for a nonlinear 
microbial model as: 


(1.9) 
+A(X,t)& (t) K(X,t) X(t) 
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Thus, carbon input U, plant carbon partition- 
ing B, transfer coefficients A, and process rates K 
all are potentially functions of pool sizes X. Carlos 
Sierra's group from the Max Planck Institute for 
Biogeochemistry in Germany have converted many 
nonlinear microbial models into matrix equations 
and explored their general behavior (Sierra and 
Miller 2015, Metzler, Miller, and Sierra 2018). 


THE PARADOX OF THE MATRIX EQUATION AND 
NONAUTONOMOUS SYSTEMS 


While we argue that all the land carbon cycle mod- 
els that follow first-order kinetics can be converted 
into a unified matrix form, the general equation is 
mathematically extremely simple (e.g. Equation 
1.6).A paradox arises: how can such a simple equa- 
tion represent the extremely complex phenomena 
of the carbon cycle observed in the real world? 

To explore this paradox, we organized a work- 
shop in 2012, sponsored by US National Institute 
for Mathematical and Biological Synthesis, 
NIMBioS. We invited 20 applied mathematicians 
and 20 ecologists to explore the paradox: why is 
the matrix equation (Equation 1.6) extremely sim- 
ple whereas carbon cycle phenomena observed in 
the real world can be very complex? We presented 
this paradox to the workshop participants. In the 
first couple of days, it was very difficult to con- 
vince the group about this issue. Some ecologists 
said that ecosystems are complex and cannot be 
described by such a simple equation. They urged 
us to re-examine this mathematical expression. 
We told them that we have used thousands of data 
sets to verify the equation. The applied mathemati- 
cians told us that this equation is too simple to be 
interesting enough for them to do any study with. 
They suggested that we add a nonlinear term in 
the equation. We told them that if a nonlinear term 
is added to the equation, it no longer describes 
the land carbon cycle as revealed by data. We were 
stubborn and kept pointing out the paradox. After 
one and a half days, Dr. James Cushing, an applied 
mathematician from the University of Arizona, 
told us that the system we were studying is prob- 
ably a nonautonomous system. 

Nonautonomous systems have been a subject of 
mathematic research in recent decades. They com- 
prise dynamical systems with input and param- 
eters being time dependent. The matrix equation 
of the land carbon cycle as expressed by Equation 


1.6, indeed, has its inputs and parameters being 
time dependent. 

With this new insight, we organized a work- 
ing group to study the nonautonomous system, 
sponsored by US NIMBioS again. The working 
group consisted of a few applied mathematicians 
and a few ecologists. One of the group members, 
Dr. Martin Rasmussen, a professor from Imperial 
College London, had coauthored two books on 
nonautonomous systems. He taught us a lot about 
how to study nonautonomous systems. We worked 
together over three years and had four meetings. 

The working group gained a few key findings. 
First, it found that nonlinear models of soil carbon 
decomposition (e.g. Equation 1.9) generate unre- 
alistic responses to small perturbations and car- 
bon input (Wang et al. 2014, Wang et al. 2016). A 
stability analysis and numerical simulations were 
conducted for two nonlinear microbial models 
(a two-pool model and a three-pool model) of 
soil carbon decomposition. Both models exhibit 
dampening oscillatory responses to small pertur- 
bations. In addition, the equilibrium pool sizes of 
litter or soil carbon are insensitive to carbon inputs 
in the nonlinear microbial models. This oscillatory 
behavior and insensitivity of soil carbon to carbon 
input exhibited by the nonlinear models have not 
been observed in the real world. 

Second, we identified a mathematical founda- 
tion through proof of theorems on exponential 
stability to explain observed convergence of the 
land carbon cycle (Rasmussen et al. 2016). The 
land carbon cycle can be considered as a linear 
nonautonomous compartmental system described 
by a dynamical equation (i.e., Equation 1.6). The 
equation can be rewritten as: 


where 
G(t) = Ag (t)K 


Matrix G is invertible. The entries g,(t) of G sat- 
isfy three conditions. First, there is always carbon 
to leave any individual pool as g,(t) < 0 for all i. 
Second, there is either no carbon flow pathway or 
some amount of carbon moving from pool j to 
pool i (i.e., g(t) = 0 for all i # j). Third, carbon 
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exiting pool i is either all going to other pools or 


some going to other pools and some being lost 
d 


from the system via respiration (i.e., by g(t) <0 
i=l 

for all j). As the land carbon cycle satisfies the three 
conditions, the system is exponentially converging 
to a time-dependent attractor, which we define as 
carbon storage capacity. The storage capacity in a 
nonautonomous system is not a static concept but 
varies with time due to the time-dependence of 
carbon input and residence time. 

The working group also found a mathematical 
representation for a transient dynamical equation 
of the land carbon cycle (Luo et al. 2017), which 
is discussed below. 


PREDICTABILITY OF THE LAND CARBON CYCLE 


In the very beginning of this chapter, we showed 
convergence of the land carbon cycle toward some 
attractor states by “Mother Nature”. This “Mother 
Nature” is the exponential stability of the car- 
bon cycle equation (Rasmussen et al. 2016). The 
certainty of any ecosystems converging to some 
attractor states after human and natural distur- 
bances reflects the high predictability of land car- 
bon cycle (Figure 1.4). 

Given one type of forcing, patterns of carbon 
cycle dynamics are usually highly predictable 
(Luo et al. 2015). For example, periodic climate 
forcing over seasons usually influences the car- 
bon cycle system to generate periodicity in car- 
bon fluxes. Similarly, one fire disturbance event 
generates a pulse release of carbon from land to 


External forcing 


Periodic climate 
(e.g., seasonal) 


Disturbance event 

(e.g., fire) 

dX(t) 
dt 


Climate change 
(e.g., rising CO2) 


Disturbance regime 


Ecosystem state change 
(e.g., tipping point) 


System equation 


= Bu(t) + A&(t)KX(t) 


the atmosphere followed by gradual recovery. The 
gradual recovery is well illustrated by vegetation 
overrunning the deserted village in China (Figure 
1.1) and thriving young lodgepole pine trees 
recolonizing the burned landscape after the 1988 
Yellowstone fires (Turner 2018) as described in 
the first section of this chapter. 

Global change factors, such as rising atmo- 
spheric CO, concentration and warming, usually 
generate gradual but directional changes in car- 
bon cycle processes. Shifts in fire regimes usu- 
ally lead to disequilibrium in the carbon cycle. 
And ecosystem state changes, due, for example, to 
land use change from forests to croplands, usually 
result in abrupt changes in ecosystem properties, 
such as photosynthetic capacity and carbon pro- 
cesses that depend on it. While many processes of 
the carbon cycle are highly predictable, precisely 
predicting carbon storage changes under climate 
change requires lots of data to constrain model 
parameterization. 


DYNAMIC DISEQUILIBRIUM OF LAND CARBON 
CYCLE 


It is well known that the carbon cycle dynamics at 
steady state can be described by two terms: car- 
bon input and residence time. The product of these 
two terms determines the carbon storage capac- 
ity. However, directional climate change pushes 
the land carbon cycle out of steady state towards 
dynamic disequilibrium (Luo and Weng 2011). To 
quantify the dynamic disequilibrium, we need a 
third term, in addition to carbon input and resi- 
dence time, to represent the transient dynam- 
ics of the land carbon cycle. The working group 


Response 
Periodicity 


Pulse-recovery 
Gradual change 
disequilibrium 


Abrupt change 


Figure 1.4. Predictability of the terrestrial carbon cycle. Responses of the carbon cycle to a given type of external forcing are 


highly predictable. 
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supported by NIMBioS worked very hard to find a 
mathematical expression of the transient dynam- 
ics. We derived a variety of mathematical formu- 
lations for the transient dynamics of the carbon 
cycle. One day we came up with the third term 
to describe the disequilibrium of the land carbon 
cycle. The original derivation was very compli- 
cated. After a few rounds of re-organization, we 
can now explain the mathematical derivation in a 
very simple way. Basically, we can multiply each 
term of Equation 1.6 by the inverse matrix of 
AG (t)K, (AG (t)K)~!, and then re-organize the equa- 
tion to get an equation of the form: 


x()=(=a£(0£) Bu 
0 a 


Then, the transient dynamics of the land carbon 
cycle can be expressed by: 


x()= tn (0)-%)(1) 


where 7, is ecosystem residence time as defined by 


t;=(-Ag(t)K) B 


and X, is carbon storage potential. It represents the 
disequilibrium term of carbon cycle as: 


X, (t) =(-45(0)K) Xx (t) 


The first part in the right side of Equation 1.12 
is the carbon storage capacity X,(t) as: 


X. (t) = Tel (t) 


This transient equation is also the dynami- 
cal equation of the land carbon cycle. The term 


(1.11) 


(1.12) 


(1.13) 


(1.14) 


dynamical equation is often used in mathemat- 
ics to describe how the state of a system changes 
over time. Although the formulation of the car- 
bon dynamical equation is simple, it represents 
a nonautonomous system, which is influenced 
by five categories of external forcing variables 
(i.e., periodic climate, disturbance event, distur- 
bance regime shift, climate change, and ecosys- 
tem state change) (Figure 1.4) (Luo et al. 2017). 
Those external forcing variables are superimposed 
on each other to influence carbon cycle dynam- 
ics, generating complex phenomena. That is why 
we observe complex phenomena of carbon cycle 
dynamics in the real world. Behind the phenom- 
ena is a relatively simple dynamical equation to 
govern the trajectory of the carbon cycle. 

In the coming five units of this book, we will show 
you how the matrix approach can unify land carbon 
cycle models, accelerate spin-up, offer new analytics 
for model diagnostics, and facilitate data assimilation 
to improve the fit of models to observations. 


SUGGESTED READING 


Luo YQ, Weng ES. (2011) Dynamic disequilibrium of 
the terrestrial carbon cycle under global change. 
Trends in Ecology & Evolution, 26, 96—104. 


QUIZZES 


1. What are the four fundamental properties of 
internal land carbon cycle processes? 


2. What are the five categories of external forcing 
to influence the carbon cycle? 
3. Briefly describe the donor-pool dominated car- 


bon transfer. 


4. How do external forcing variables interact with 
internal processes to create the complex phe- 
nomena of the land carbon cycle? 
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This chapter introduces the concept of a model, 
and its role within modern research methodology 
and the scientific method. Some typical characteris- 
tics of the system dynamics models used in carbon 
cycle studies are described, alongside examples of 
different ways they can be applied in ecosystem 
and Earth system research. With reference to in- 
depth discussion and examples throughout the 
book, we introduce the workflow of six steps you 
would typically follow when integrating a model 
within a robust research study design. 


WHATIS AMODEL? 


The dictionary definition of a model is an ideal- 
ized, simplified or down-sized representation of something, 
the purpose of which is to describe, explain or depict 
that something that the model represents, which 
we may refer to as its object. The model encodes 
information about certain, well-chosen aspects 
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of the object, and its purpose is to convey that 
information to the viewer or user of the model 
in a clear, concise, and potentially useful way. 
Compared to the phenomenon or notion it rep- 
resents, the model may constitute a simpler, more 
lucid, or even an exaggerated representation of 
those aspects it is designed to convey — and that is 
exactly the purpose (Figure 2.1). Models induce 
us to perceive — and may sometimes help us to 
understand — something significant about the 
object by discarding extraneous detail and focus- 
ing on the essential. Thus, models are designed to 
simplify, explain, and communicate, and these aspects are 
interdependent: a simple explanation of a com- 
plex idea makes it easier to understand and convey 
to others. Similarly, models are used in learning, 
research, and academic exchange as a way of sim- 
plifying, synthesizing, and communicating knowl- 
edge, data and ideas, allowing us to put them into 
practice in useful ways. 
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Figure 2.1. Models are common in everyday life, and serve a similar purpose as in learning, research, and academic inter- 
change, namely to simplify complex information and facilitate communication and understanding. 


MODELS INRESEARCH 


Models in a broad sense are fundamental to mod- 
ern research methodology. In its simplest form, 
a model may be a mental construct describing a 
perceived pattern or relationship; for example, the 
simple observation that an increase in one variable 
tends to be consistently associated with an increase 
or decrease in another. The existence of patterns 
and relationships in nature has surely been noted 
by people throughout time. The fact that certain 
types of edible plants were often found growing in 
a particular physical situation, or associated with 
certain conspicuous species, would have been 
important practical knowledge in hunter-gatherer 
societies. From the beginnings of agriculture, 
farmers have taken note of environmental events 
that signaled propitious conditions for sowing and 
harvesting their crops, or that warned of hazards 
such as droughts, freezing conditions, or insect 
plagues. Philosophers of ancient Greece pioneered 
the documentation of patterns and relationships in 
nature in a systematic way, reasoning about their 
wider significance. In modern terminology, they 
could be said to be the first to employ modeling 
as part of the research process. The ‘models’ of 
this era were qualitative, mental models, describ- 
ing and potentially suggesting explanations for 
observed patterns and relationships, and expressed 
in words, illustrations, or allegories. 

Today, models, whether mental, conceptual, or 
mathematically formalized (Figure 2.2), feature 
in most scientific research. Oreskes (1994) stated: 
“Numerical models are a form of highly complex 
scientific hypothesis”. The hypothesis, of course, 
is an element or ‘step’ in the empirical scientific 
method, pioneered by Galileo in the 17th century. 
The core of this method is the experiment, by which 


observations from the real world are collected and 
examined in such a way as to shed light on the valid- 
ity of a hypothesis. Often, several or many empirical 
studies addressing the same or related hypotheses 
are needed before scientists achieve consensus and 
the former hypothesis becomes an element of a 
consensus theory or ‘settled science’. The hypoth- 
esis behind a given study is rarely entirely novel, 
but represents a step beyond the existing frontier 
of knowledge in the research area or field. We can 
think of the hypothesis as an ‘informed guess’ given 
what we already know thanks to the consensus 
knowledge and theory of the field, accrued over 
earlier studies and scientific discourse. 

Similarly, the models we commonly use in 
research have the character that they bring together 
elements of established knowledge with ideas or 
informed guesses about things we do not yet know 
with certainty, or lack data to express in a precise, 
quantitative way. In this sense, a model provides 
a context or frame for integrating knowledge 
and observations to pose questions about the real 
world in the form of a testable hypothesis. A model 
is not necessarily ‘true’ or proven, but can stand 
as a formalized or explicit hypothesis of how the 
system under study works. 

Many different kinds of models are applied 
within the environmental and earth sciences. This 
book focuses mainly on numerical simulation 
models, implemented as software algorithms and 
executed on a computer. A familiar class of simula- 
tion models are the numerical weather prediction 
(NWP) models used by weather service agencies 
(like the US National Weather Service) to produce 
daily and longer-term weather forecasts. These 
models depend for the (relative) accuracy of their 
predictions on knowledge of the physical pro- 
cesses that govern the dynamics of weather, and 
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Figure 2.2. Three increasingly formal and precisely specified models that express a common hypothesis. 


on extensive observations (such as measurements 
from weather stations, balloons and satellites) 
both to “train” the model, improving its fit with 
the measurements, and to fix the state or “initial 
conditions”. Integrating observations in this way is 
called ‘constraining’ the model. 

NWP models are an example of the class of mod- 
els denoted as mechanistic or process-based. Many 
of the core equations of an NWP model express 
known physical relationships or laws governing 
energy balance and motion, applied to the three- 
dimensional atmosphere. Others, such as equations 
governing the formation and behavior of clouds, 
are parameterized, i.e., fitted to observations, but 
informed by physics. This combination of mecha- 
nistic (process) knowledge and empirical fitting is 
characteristic for most process-based models. 


WAYS OF USING MODELS 


Perhaps because weather prediction models and 
forecasts are so familiar from daily life, many 
people, including scientists who do not habitually 
work with numerical models as part of their meth- 
odological toolbox, tend to think of such models 
primarily as tools for prediction forward in time. 
Certainly, extrapolating beyond the range of obser- 
vations (forward in time, or in any temporal or 
spatial context for which observations are sparse 
or lacking) can be one useful way to apply some 
types of models (though not all models are suitable 
for this). However, this is far from being the only 
way modeling can form part of research methodol- 
ogy, and is perhaps that with the least relevance to 
science's central endeavor to advance fundamental 
knowledge about the natural world. In general, the 
potential for new discoveries is greatest at the inter- 
face between modeling and observation, where 
information on real-world phenomena encoded in 


data meets the potential of modeling to explain or 
decode patterns in the data in terms of underlying 
mechanisms of cause and effect. 

Box 2.1 shows some ways in which process- 
based models of the kind covered by this book 


BOX 2.1 Modes of model application 

e project future changes and 
impacts, extrapolate beyond cur- 
rent data (e.g., Chapter 18 — C 
cycle transient responses to future 
climate change) 


e scale-up findings from local stud- 
ies to regional or global scale (e.g., 
Chapter 27 — model-data fusion of 
sub-continental GPP) 


e characterise uncertainty, identify 
robust responses and relation- 
ships (e.g, Chapter 17 — tracing 
uncertainty sources in land carbon 
models) 


* attribute observed relationships 
and patterns to underlying drivers 
and mechanisms (e.g., Chapter 30 
— mechanisms controlling lake C 
dynamics) 

e synthesise the state of knowledge, 
identify gaps, generate hypotheses 
to guide empirical studies (e.g., 
Chapter 33 — ecological forecast- 
ing of field experiments) 


increasing potential for integration with empirical studies 


* communicate scientific evi- 
dence to societal end-users and 


decision-makers 
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can be applied within research on the land car- 
bon cycle. In general, the potential for integra- 
tion between modeling and empirical approaches 
increases as you go down through the list. We will 
see examples of most of these modes of applying 
models in the course of this book. 


SYSTEMDYNAMICS 


Land carbon cycle models are at their core system 
dynamic models. Systems are fundamental to the 
organization and processes of human society and 
our daily lives. Think of a “political system”, “edu- 
cational system' or “energy system' as examples. 
Fundamentally, systems are a mental framework 
through which we may view, understand or orga- 
nize complex activities or ideas that involve or 
depend on linkages between different interacting 
elements or parts. The presence of interlinked ele- 
ments, constituting a “whole that is greater than 
the sum of its parts’ is characteristic for a system. 
‘Systems thinking’ can also be applied to nature 
and has had a powerful influence on the develop- 
ment of quantitative techniques in ecology and 
numerous other fields. General systems theory, for- 
malized in the 1940s, depicts an entity under study 
as a network of interrelated elements whose prop- 
erties and linkages influence the behavior of the 
system and its evolution over time. Systems theory 
provides a framework for formalizing a conceptual 
or hypothetical understanding of an object of study 
in terms of drivers, responding processes and net- 
works of interacting elements, unified through the 
consideration of flows — for example energy, mat- 
ter, or capital — across the network. 

When we model the carbon cycle, the system in 
focus is an ecosystem, or more broadly the Earth sys- 
tem (constituting the Earth’s interacting ‘spheres’ — 
the atmosphere, hydrosphere, biosphere etc.). The 
origins of the ecosystem concept can be traced to 
the British botanist Arthur Tansely, writing in 1935, 
who argued for the need to consider organisms 
and their environment as part of a unified whole 
in order to understand patterns and changes in 
nature. Tansley noted that organisms are locked into 
constant interactions with their immediate physi- 
cal environment, and that the interactions between 
individuals and species are mediated by the changes 
they impose on physical (and chemical) factors of 
the microenvironment through their growth and 
function. This notion that interactions between organ- 
isms and the air, soil, and water in which they occur 


govern pattern and process in nature is central to 
systems thinking in ecology. 


TYPES OFLAND CARBON CYCLE MODELS 


Today’s ecosystem and Earth system models (ESMs) 
can be considered the end branches of a family 
tree having its roots in the work of the Odum 
brothers, Eugene and Howard, and their text- 
book, Fundamentals of Ecology, first published in 1953. 
They combined concepts from system dynamics 
theory with analogies from electrical engineering 
and thermodynamics to frame ecological systems 
as networks in which interactions result from 
energy flows between trophic species levels, link- 
ing organisms with their abiotic environment via 
biological analogues of concepts from electrical 
networks such as voltage, capacitance and resis- 
tance. Modern biogeochemical models build on 
similar foundations to Howard Odum’s depiction 
of energy flow for the Silver Springs aquatic eco- 
system in Florida, USA (Odum 1971; Figure 2.3). 
Energy that enters the system as sunlight, powering 
the photosynthesis of producers (algae and aquatic 
plants), is used to drive the metabolism of the pro- 
ducers and the consumers that depend on them, 
up through the trophic chain. At each trophic step, 
energy is lost through respiration, returning heat 
to the environment. The Silver Springs ecosystem is 
here represented as a system of compartments (the 
trophic groups) that exchange energy with each 
other and with the external environment. Each 
compartment has a store of biochemical energy, 
linked to the abundance of organisms in that 
trophic group, and the size of this store changes 
over time as energy is gained and lost through 
processes like photosynthesis, herbivory, preda- 
tion and detritus production. All flows into, out 
of and between compartments of the system are 
mirrored by changes in the sizes of the compart- 
ments themselves — the system is said to uphold 
mass balance (here expressed in units of energy). 
We can state that the Silver Springs ecosystem is 
here depicted as a compartmental dynamic system. Such 
a representation has some very useful mathemati- 
cal properties for simulation and analysis of the 
system, as we shall see elsewhere in this book, and 
as discussed in detail in Chapter 7. 

An alternative ‘currency’ for this model, instead 
of energy, would be the carbon that enters the 
system as CO, assimilated through photosynthe- 
sis, is transferred between trophic levels through 
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Figure 2.3. Energy flow diagram for the Silver Springs ecosystem in Florida. Adapted by S. Maud from Odum, H.T. (1971) 


Environment, Power, and Society. Wiley-Interscience, New York. 


herbivory, production of detritus, or predation, 
and is lost to the system as CO, released through 
autotrophic or heterotrophic respiration (in this 
aquatic system, there are also imports and exports 
due to streamflow and movement of organisms 
in and out of the study area). Similarly, ecosystem 
models in use today typically adopt a carbon-based 
representation of the ecosystem and its network 
of pools and fluxes as their basic framework. An 
example of such a framework was presented for 
the TECO model in Chapter 1. Superimposed on 
this, many models incorporate representations 
of hydrological, nutrient and energy cycles that 
interact with carbon cycle processes and states. In 
Chapter 6 we shall see how a nitrogen cycle may 
be added to the carbon-only version of TECO, to 
account for progressive nitrogen limitation under 
elevated atmospheric CO,. 

Some of the main ‘families’ of modern land 
models incorporating representations of carbon 
processes are detailed in Table 2.1.These groupings 
and their defining characteristics as shown in the 


table are not clear-cut. On the contrary, each model 
family has evolved over time as investigators have 
enhanced and adapted existing tools for applica- 
tion to novel research questions and manage- 
ment problems. Each type of model was originally 
developed with a certain research goal in mind. 
For example, Soil-Vegetation-Atmosphere Transfer 
(SVAT) schemes, first developed in the 1980s, 
focus on land surface hydrology and energy bal- 
ance, as important exchange processes that affect 
the dynamics of the atmosphere, driving weather 
and climate. SVAT models like the revised version 
of the Simple Biosphere Model (SiB2; Sellers et 
al. 1996), which originally did not incorporate 
explicit carbon processes, were extended to incor- 
porate a representation of canopy CO, exchange 
because an estimate of photosynthesis rate was 
required as a control on stomatal conductance and 
evapotranspiration, in turn affecting the partition- 
ing of the vegetation-atmosphere energy flux into 
sensible and latent heat components. Modern land 
surface models (LSMs) coupled to ESM frameworks 
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TABLE 2.1 
Model types used in contemporary land carbon cycle research 


Soil-vegetation- 


atmosphere Forest Terrestrial Dynamic global Biogeochemical 
transfer scheme growth biogeochemistry vegetation Land surface 
(SVAT) model model model (DGVM) model (LSM) 
Example SiB2 3PG, 4C, CENTURY, TECO IBIS, LPJ-GUESS, CABLE, 
SORTIE ED2 ORCHIDEE, 
CLM4.5 
Energy cycling/ Y — sometimes sometimes Y 
balance 
Hydrological Y Y Y Y Y 
cycling/balance 
Canopy physiology/ Y Y Y Y Y 
CO, exchange 
Plant C dynamics — Y Y Y Y 
Belowground C E sometimes Y Y Y 
dynamics 
Nutrient (N, P) — sometimes Y sometimes sometimes 
dynamics 
Plant functional static static/ static dynamic usually static 
types (PETS) dynamic 
Stand dynamics — sometimes — sometimes = 
Typical application local/ global local local regional/global regional/global 
and used for global climate change simulations MODELING WORKFLOW 


have evolved from SVAT models, progressively 
incorporating more processes and interactions, 
often adopted from schemes first developed for 
the other model families like forest growth models 
(used to simulate the growth and dynamics of trees 
and forest stands), terrestrial biogeochemistry 
models (adding belowground carbon and nutrient 
dynamics) and dynamic global vegetation mod- 
els (DGVMs; adding competition between plant 
functional types or PFTs). In fact, current repre- 
sentatives of many of the model families shown in 
Table 2.1 have converged to incorporate many of 
the same algorithms, functionality, and underlying 
theory across model families. The frameworks con- 
tributing to global studies and assessments, such as 
annual updates of the global carbon budget (Sitch 
et al. 2015), and to account for biogeochemical 
and biophysical feedbacks to climate change in 
IPCC climate projections, are collectively termed 
terrestrial biosphere models (TBMs). Fisher et al. 
(2018) provides a useful account of the current 
status of, and research front for, TBMs incorpo- 
rated within ESM frameworks. 


When using models as part of a robust research 
study design, there are a number of steps to go 
through. Depending on the study and the prior 
work we are building on, by ourselves or others, 
we may leave out some steps, change the sequence, 
or iterate more than once over a given set of steps. 
However, most modelers would agree, in princi- 
ple, that a study should feature the following steps, 
and that these should be systematically described 
in publications discussing the work, allowing oth- 
ers to fully understand and reproduce it. 


Specify the Question or Hypothesis and Identify 
How Modeling Can Help 


It is rather elementary that any scientific study 
begins with a question, and linked to this, a test- 
able hypothesis or hypotheses. If the hypothesis 
can be expressed or captured in the form of a 
model, and there are data available to compare and 
contrast to the model results, modeling may be 
valid to consider as part of the study design. 
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Choose a Model 


This step could also be labelled “develop a model’ 
or ‘adapt a model’. Textbooks on modeling often 
advocate for building the model from first princi- 
ples, starting from a conceptual diagram that cap- 
tures the hypotheses of the study, to be reflected 
in the model. In an ideal world, this ensures the 
chosen model is targeted specifically to the ques- 
tion of the study, no more and no less. This prin- 
ciple is captured by the popular modeler saying 
“the model should be as simple as possible, but 
no simpler”. 

In practice, the carbon cycle literature contains 
only rare examples of studies using completely 
novel model structures and codes. Many current 
model frameworks have evolved over decades, 
involving many person-years of development, cod- 
ing, evaluation, and application. Studies employing 
the model along the way turn up issues (scientific, 
as well as programming errors) focusing atten- 
tion on needed revisions and improvements, but 
also increase confidence in the model where they 
demonstrate that it is able to reproduce patterns or 
relationships seen in observations, or in indepen- 
dent studies using alternative approaches. A typical 
ecosystem model is a sophisticated software appli- 
cation comprising hundreds to thousands of lines 
of computer code. Even though it could be argued 
that many models have grown overly complicated, 
to at least some degree this reflects the complexity 
of real ecosystems, the range of biotic and abiotic 
factors and interactions that govern their dynam- 
ics, and the steadily advancing research front in 
understanding and modeling natural systems. 

Thus, in carbon cycle science, it is seldom effi- 
cient or realistic for an investigator to develop a 
new model completely from scratch, even if we 
accept that this would be the ideal. Rather, it is a 
question of choosing a model framework that fits 
the research question and study system in terms of 
criteria such as: model versus system complexity, 
scaling assumptions (temporal/spatial), available 
evidence of model skill (e.g., past published stud- 
ies on similar systems or questions), configurabil- 
ity of parameters, input files, and source code, and 
availability of code and documentation. 

A common pitfall is to select a model that has 
convenient technical features but is poorly matched 
to the target study in terms of spatial and tempo- 
ral scale. There are two essential aspects to this. 
One is that the same process can exhibit different 


sensitivity depending on the scale of observation. 
For example, due to structural, functional and 
compositional heterogeneity, photosynthesis mea- 
sured at the leaf scale in a forest will show a dif- 
ferent relationship to drivers such as temperature, 
insolation or CO, concentration compared to the 
canopy, stand or landscape scale. Decomposition 
of soil organic matter will show a different appar- 
ent sensitivity to temperature (Q,,) in a chamber 
measurement over an hour, compared to an incu- 
bation experiment over a year, or a decades-long 
soil inventory dataset. This means that the structure 
and parameters of the most suitable model to study 
variations in photosynthesis or decomposition 
will differ depending on the temporal and spatial 
context of the system under study and the research 
question in focus. Second, different processes and 
entities of the ecosystem are important in con- 
trolling variations at different scales. For example, 
seasonal cycles of leaf production and shedding 
(phenology) are important for the productivity of 
nemoral forests at annual and longer time scales, 
but leaf area index could be prescribed (specified 
as a fixed value) when modeling the gas exchange 
of a nemoral tree over a diurnal cycle. Changes in 
the sizes of ‘slow’ and ‘passive’ soil organic matter 
(SOM) pools tend to dominate soil carbon dynam- 
ics on scales of centuries and millennia, but have 
no impact on variation in CO, flux in short-term 
chamber measurements. This implies that a single- 
pool SOM model may be suitable for the analysis 
of data from soil chambers, whereas a model used 
to analyze changes in global soil carbon under 
IPCC future climate projections needs to distin- 
guish multiple SOM pools of different lability (this 
is further discussed in Chapter 23). Ensuring that 
the chosen model accounts for the processes and 
entities relevant to the scale and question of the 
study is a particularly important consideration in 
carbon cycle studies. 


Verify that the Model Works 


Modelers distinguish between ‘verification’ and 
‘validation’. The former seeks confirmation that 
the model implementation behaves as expected in 
a purely technical sense. When we run a numerical 
model we are ‘solving’ the model given the input 
data. Verification entails ensuring that the output 
data from the model matches the solution of the 
set of difference (or balance) equations the model 
(typically) encodes (see Chapter 3). In practice we 
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seldom have the true (analytical) solution of these 
equations available to compare with the output of 
the model, but, based on our scientific knowledge 
of the system we are simulating, we can usually tell 
whether the output is reasonable or ‘sane’ given 
the forcing. For example, carbon pool sizes should 
not normally go negative, while fluxes such as GPP, 
autotrophic and heterotrophic respiration, or CH, 
emission, should be within a certain range. 


Calibrate the Model 


Some though not all models can be calibrated 
against different types of observational or mea- 
surement data. For a system dynamic model, 
calibration may involve tuning parameters of indi- 
vidual equations or processes based on measure- 
ments or estimates of those processes. For example, 
key parameters of a biochemical photosynthesis 
model may be calibrated based on gas exchange 
measurements of leaves, yielding a so-called A-c, 
curve. When you choose an existing model frame- 
work for your study, this kind of process-level cali- 
bration will likely already have been done, though 
you may have good reason to perform your own 
calibration if you have relevant measurements 
from your system. 

A greater challenge is performing calibration 
on the overall output of the model, where this 
emerges from interactions between different pro- 
cesses, drivers and the evolving system state. For 
example, net biome production (NBP) in a global 
carbon cycle model may be the emergent balance 
between uptake (photosynthesis/GPP) and mul- 
tiple release fluxes (autotrophic and heterotrophic 
respiration, emissions from wildfires) integrated 
across the land ecosystems of the world. The 
release fluxes in particular depend not only on cur- 
rent environmental drivers, but on the cumulative 
sizes of source pools such as SOM pools of differ- 
ent quality/lability in different climates. Because 
of the long response lags, or spin-up effect, in such 
a model, and due to spatial heterogeneity, errors 
in individual processes can rapidly accumulate 
and combine to generate large bias in NBP even 
if individual processes are well calibrated to mea- 
surement data, where such are available. 

This book offers a number of the best available 
solutions to the considerable challenge of calibrat- 
ing the emergent output of an ecosystem model, 
in all its complexity. In essence, these involve 


identifying the parameters across different process 
formulations of the model to which the model is 
most sensitive (with respect to a particular output 
variable, such as NBP), specifying the potential 
real-world range of each parameter, and search- 
ing the hyperdimensional space of this parameter 
set (within the realistic range) to find a combi- 
nation of values that yields the best fit between 
the model output, given those parameter values, 
and observational data on the same variable. This 
approach, termed data assimilation or model-data 
fusion, is introduced in Chapter 21, and further 
elaborated and exemplified in subsequent chapters 
of Units 6-8. 


Validate the Model 


In the introduction to this chapter it was noted that 
all models, by definition, are a simplified depiction 
of the real thing they represent. This is not a failing 
or shortcoming but the very essence of a model. 
However, being simpler also implies that even a 
good model can never be expected to replicate the 
behavior of a real system exactly, given that some 
processes and interactions involved in the behavior 
of the real system are missing. Thus, model error 
is inherent in the very concept and purpose of a 
model. Box (1979) stated: “all models are wrong, 
but some are useful”. 

The simple observation that a model can never 
be perfectly ‘true’ has led some writers (e.g., 
Oreskes (1994)) to suggest a model cannot be 
validated, in the sense that validation implies con- 
firmation of truth. This argument is largely seman- 
tic however, and in practice we are of course very 
interested in knowing the extent to which the 
model behaves in a similar way to what we know, 
from observation and theory, of the real system 
under study. Canham (2003) stated: “The process 
of evaluating model structure is clearly critical 
enough to warrant a specific term, and ‘validation’ 
appears to be the best candidate”. 

A robust validation strategy focuses not only on 
the emergent output of the model (for example, 
NBP), but on the behavior of individual process 
representations and the assumptions they entail. 
This is because compensating errors and biases 
in different processes, or for example across the 
grid of a spatially-distributed model, can coinci- 
dentally produce the ‘right result for the wrong 
reason’, when focusing only on an emergent or 
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integrated output such as NBP This issue is for- 
mally termed equifinality, the potential to obtain 
similar results with different model structures 
and parameterizations (Luo et al. 2009). The 
Free-Air CO, Enrichment Model Data Synthesis 
(FACE-MDS) initiative have pioneered best prac- 
tice in validation of complex ecosystem models 
using datasets from field experiments through an 
‘assumption-centered’ approach (Medlyn et al. 
2015). The alternative, perhaps more established 
and less labor-intensive, approach of benchmarking 
uses data on multiple output variables to simul- 
taneously query the skill of the model in differ- 
ent dimensions, reflecting different processes and 
feedbacks (Chapter 19). 


Design the Model Experiment 


The model experiment, sometimes also termed 
the simulation protocol, tailors the configuration, 
forcing and output data from the model simula- 
tion so as to shed light on the research question 
of the study. As discussed in the introduction to 
this chapter, the model typically stands for, and 
incorporates, our hypothesis or hypotheses about 
the system and its responses to influences such as 
a shift in climate or a management intervention. 
Through the model experiment, we wish to probe 
whether, or under what assumptions and condi- 
tions, the model reproduces salient observations 
of the system, thereby testing our hypothesis as 
encapsulated by the model, and allowing us to 
draw inferences about the behavior of the system. 

There are several elements to designing a robust 
model experiment and some of these are challeng- 
ing in the case of carbon cycle studies. One chal- 
lenge concerns the initialization or spin-up of the 
model state. In general, the spin-up strives to attain 
a steady state for ecosystem carbon pools ahead 
of the main part of the model simulation, avoid- 
ing drift in the model output. Some challenges and 
solutions are presented in Chapter 13 and exem- 
plified in Chapter 14. An alternative to the steady 
state assumption is discussed in Chapter 27. 


The book contains many examples of model 
experiments tailored to different research ques- 
tions and problems. Some examples were listed in 
Box 2.1. 


SUMMARY 


We have seen that the model, whether conceptual 
or formally specified, is an inherent component of 
research methodology and the scientific method, 
integrating the knowledge we have with hypoth- 
eses we may wish to pose about the system under 
study. Several different types of models are applied 
within carbon cycle research, but this book 
focuses on numerical simulation models encap- 
sulating networks of plant and soil compartments 
and the processes that govern the flows of carbon 
and other elements across the network, influenced 
by environmental drivers. The workflow of six 
steps, further elaborated in other parts of the book, 
serves as a guide to how we may integrate model- 
ing in the design of a robust scientific study. 


SUGGESTED READING 


Canham, C.D., Cole, J.J. & Lauenroth, WK. 2003. 
Models in Ecosystem Science. Princeton University Press, 
Princeton, NJ. 


QUIZZES 


1. “The model should be as simple as possible, but 
no simpler” — Discuss! 


2. List four ways in which models and observations 
can be combined within ecosystem studies. 
What is the role of the model versus the obser- 
vational data in each case? 


3. The carbon cycle of an ecosystem can be 
depicted as a compartmental dynamic system. 
List four properties of such a system. 


4. How is the spatial and temporal scale of a study 
relevant in selecting a suitable model? 


5. Define these terms: calibration, verification, vali- 
dation, benchmarking. 
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Great oaks from little acorns grow. This chapter 
offers basic concepts and tools for building land 
carbon models. If we build on them step-by-step, 
we can end up understanding complex land car- 
bon models. The goal of this chapter is to under- 
stand carbon flow diagrams and balance equations, 
to be able to derive generic carbon balance equa- 
tions from carbon flow diagrams, and vice versa. 


CARBON FLOW DIAGRAM 


Flow diagrams are not new to us. We use flow 
diagrams in different forms, complexities and for 
different purposes. For example, we could draw 
a simple flow diagram to help us diagnose what 
might be the issue and what to do when a com- 
puter stops working (Figure 3.1b). Or you could 
find a complex flow diagram like the schematic 
of carbon, nitrogen and phosphorus cycles con- 
sidered in ORCHIDEE-CNP (Figure 3.1b), which 
illustrates sophisticated interactions among car- 
bon, nitrogen and phosphorus dynamics. A classic 
carbon flow diagram from IPCC reports is shown 
in Figure 3.1c. When it comes to carbon, we are 
generally interested in tracking the amount of 
carbon in space and time. In Figure 3.1c, you see 
different compartments or pools with different 
amount of carbon. You also see flows of carbon 
from one compartment to another. 
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We use land carbon models to track temporal- 
spatial dynamics of carbon, to answer questions like: 
Where is carbon stored? How long does carbon stay 
in one place? When, how, where, and how much 
carbon is transferred? Depending on the scientific 
questions, we may also use land carbon models to 
track water, nutrients, and energy as they are parts 
of the critical environment that drives and interacts 
with carbon dynamics. Earth is a system comprised 
of numerous interacting processes. Here, we focus 
mainly on carbon dynamics, but a lot of principles 
and methods introduced here are also applicable to 
water, nutrients, and energy. A flow diagram is very 
helpful in conceptualizing our issues, clarifying study 
boundaries, and disentangling complex interactions. 

We use the terms stock (or storage, pool) and 
flow (or flux) very frequently in carbon studies. 
Suppose that you have a bank account. The total 
amount of money in your bank account is the 
stock. Every month, you deposit a certain amount 
of money, for example, from your salary. You also 
withdraw some money each month for your every- 
day expenses. This deposit and withdrawal are the 
flows into and out of your account. The same con- 
cept applies to carbon. We could take total carbon 
in the biosphere, for example, as our stock, and 
we track carbon flows into (deposits) and out of 
(withdrawals) the biosphere to understand the 
dynamics of carbon stock in the biosphere. 
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Figure 3.1. Examples of flow diagrams. Panel b is adapted from Sun et al., 2021 and Panel c from Ciais et al., 2014. 


We need to know the boundaries of our sys- 
tem to be clear on what are our stocks, what are 
our flows and in which directions they are mov- 
ing. Take terrestrial carbon flows as an example 
(Figure 3.2). When we take terrestrial plants as 
our study system, carbon in plants is the stock. 
Carbon that goes through photosynthesis, a pro- 
cess that turns carbon dioxide into sugars (carbo- 
hydrates, or organic carbon), is our incoming flow. 
Respiration involves using the sugars produced 


during photosynthesis plus oxygen to produce 
energy for plant growth. It is one of our outgo- 
ing carbon flows. Litterfall, in which the plant 
sheds leaves, fine roots and branches as part of its 
phenological (seasonal) cycles, or under unfavor- 
able conditions, is the second outgoing flow. If we 
take soil as our study system, soil carbon is the 
stock. Litterfall is the incoming carbon flow. Soil 
respiration is the outgoing carbon flow and car- 
bon lost during runoff is another outgoing flow. 
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Figure 3.2. An idealized illustration of terrestrial processes of carbon flow. 


In the field, soil respiration generally comprises 
two main components: the autotrophic respira- 
tion that takes place in plant roots and the het- 
erotrophic respiration due to the breakdown of 
soil organic matter by microbes. If, instead, we 
take plant and soil together as our system, the car- 
bon stock is the total carbon in plant and soil. The 
incoming carbon flux is the carbon flow through 
photosynthesis and the outgoing fluxes are plant 
respiration, soil respiration and carbon lost during 
runoff. Litterfall is no longer an incoming or out- 
going flow, but an internal flow linking the plant 
and soil carbon stocks. 


CARBON BALANCE EQUATIONS 


The principle for tracking carbon dynamics is the 
law of conservation of mass. Matter can neither be 


created nor destroyed, but chemical structure and 
physical form can change. 

If we know the initial carbon pool size, and the 
values of carbon fluxes going into and out of the 
pool, we can track the dynamics of the carbon pool 
through time. This follows the law of conservation 
of mass. In terms of carbon, the change in its pool 
size is always equal to the mass of total incoming 
minus total outgoing flows. Suppose we have a 
carbon pool (I call it x here) that weighs 30,000 
grams (gC) initially, i.e., at time, t = 0 (Figure 3.3). 
You could think about it as if you have $30,000 
in your bank account at the start of the year. Let's 
say the incoming flux is 2,000 gC year”!. We call it 
the input rate I. In our monetary analogy, it corre- 
sponds to your annual salary. Let’s say your income 
is relatively stable, then I does not change with time. 
For the outgoing flow, the donor pool-dominant 
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Figure 3.3. A conceptual one-pool carbon model. 
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transfer is a common example from the land carbon 
cycle. For example, if x represents the carbon con- 
tained in the biomass of a plant, respiration repre- 
sents an outgoing flow that depends on the size of 
this pool: if we have a bigger plant, we would end 
up with more respiration. Back to our analogy: the 
higher your income, the more tax you pay. Suppose 
20% of carbon in the plant pool is used up through 
respiration every year. This can be expressed as a 
turnover rate: a proportion of the pool that is used 
up and converted to an outgoing flux per unit time. 
Let's give it the symbol k. Here, k = 0.2 year”!. So 
how much carbon will we have at the end of the 
year or at the beginning of the next year? 

Here our time step is 1 year. When t = 0, we 
have x = 30,000 gC (initial pool size), I = 2000 gC 
year”, k = 0.2 year”!, 

When t = 1, our pool size x (t = 1) = x 
(t = 0) + input — output = 30,000 + 2,000x1 — 
0.2x30,000x1 = 26,000. 

What if we want to know x when t = 2? It is the 
same logic. x (t = 2) = x (t = 1) + input — output 
= 26,000 + 2,000x1 — 0.2x26,000x1 = 22,800. 
Similarly, we could derive the carbon stock in the 
third, fourth, fifth year, and so on. The mathematic 
equation for x could be summarized as: 

x(t+1) =x(t)+1xAt-kxx(t)xAt (3.1) 
which says x at the next time step equals x at the 
current time step, plus the change of x. The time 
step (1 year) is denoted by At. The change of x is 
expressed as the input flow minus the output flow. 
This is the basis for the Euler method, a numerical 
method we sometimes apply to solve the differen- 
tial equation to be introduced below. 

Equation 3.1 can be rewritten as, 


ed) o 


E (3.2) 
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(I x At) (x1) 


Figure 3.4. A conceptual two-pool carbon model. 
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When At => 0, Equation 3.2 is the same as, 


dx 


č% kx 
dt 


(3.3) 


Equation 3.2 is the difference version of the 
carbon balance equation, and Equation 3.3 is the 
differential version. Both equations embody the 
law of conservation of mass. Another way of put- 
ting this is, they uphold mass balance: the rate 
of change of a carbon pool equals the input rate 
minus the output rate. 

If we have two carbon pools in our sys- 
tem, sometimes the outgoing flux of one pool 
might become the incoming flux of the other 
(Figure 3.4). For example, in our plant-soil system 
in Figure 3.2, the litterfall is an outgoing flow of 
the plant pool and an incoming flow into the soil 
pool. For the plant, the carbon balance equation 
can be written the same as in the above one-pool 
case. The carbon balance equation of the second 
(soil) pool also expresses the mass balance. Here 
the incoming rate is a fraction, say f,,, of the out- 
going flux from pool x,. The outgoing flux from 
pool x, is a constant annual proportion of the plant 
pool: k,x,. This outgoing flux comprises the frac- 
tion that goes into soil (e.g., through litterfall) and 
the fraction that does not enter soil (e.g, plant 
respiration). Combining these terms, we have the 
incoming flux of pool x, as f,,k,x,. The outgoing 
flux from x, is also pool-size dependent: k,x,. The 
carbon balance equations for this two-pool system 
are given by Equations 3.4 and 3.5: 


dx1 


— =I—k,x 
dt ds 


(3.4) 


(3.5) 
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Figure 3.5. Flow diagrams of two carbon models: (a) CENTURY; (b) TECO. (a: adapted from Parton et al., 1988; b: adapted 


from Xu et al. 2006). 


We may extend this same logic to models with 
any number of pools. For each pool, we need to 
know what are its incoming and what are the out- 
going fluxes. Figure 3.5 shows the carbon flow 
diagram for two well-known carbon models. The 
left one is the classic soil carbon model, CENTURY 
(simplified, Parton, et al., 1988). The right is the 
TECO model (Xu et al., 2006). Both models con- 
ceptualize our natural carbon cycles into different 
pools or ‘compartments’, represented by boxes 
in the diagram, and track the transfers of carbon 
among different pools, represented as arrows. 
The transfers could be parameterized as one con- 
stant turnover rate relative to the size of the donor 
(source) pool, or could be modeled by functions 
with multiple terms, parameters, and dependen- 
cies, capturing complex interactions with climate, 
soil, and other factors. 


FROM FLOW DIAGRAM TO CARBON BALANCE 
EQUATIONS 


We take the TECO model as an example to illustrate 
how to derive a system of carbon balance equa- 
tions from the carbon flow diagram. In Figure 3.5, 
we see that TECO has seven carbon pools: two plant 
biomass carbon pools (foliage, woody biomass), 
two litter carbon pools (metabolic, structure), and 


three soil organic carbon pools (microbes, slow 
soil organic matter (SOM) and passive SOM). We 
use x, to x, to denote these pools. For each carbon 
pool, we need to specify the fluxes that enter and 
leave the pool. Carbon flux through photosynthe- 
sis is the external carbon input into our system of 
seven pools. We will denote this input rate by I. 
Photosynthetic carbon input is allocated into foli- 
age and woody biomass, in relative proportions 
denoted by the allocation coefficients $, and f,. 
The carbon balance equation for foliage pool (x,) 
and woody pool go as Equations 3.6 and 3.7. 


dx 

— = IB, -kx (3.6) 
dt 

dx 

Ta = 1B, -kx (3.7) 


The structure litter pool (x,) receives carbon 
input from pools x, and x,. So the carbon bal- 
ance equation of x, has two input fluxes. One is 
a fraction (f,,) of the outgoing carbon flux from 
x, (k,x,). The other is a fraction of the outgoing 
carbon flux from x,. This incoming flux could be 
written as f,,k,x,. The outgoing flux for pool x, is 
dependent on the pool size and its turnover rate, 
k, We write this flux as k,x,. Equation 3.9 is the 
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resulting carbon balance equation of the structure 
litter pool, x,. 


dx, 


en = £,1k,x, + £i,k,x, — k4X4 (3.8) 
t 


The same procedure applies for the other car- 
bon pools. We count how many arrows go into 
a pool and how many go out. One arrow nor- 
mally corresponds to one flux. The boundary of 
the system in this example encompasses the seven 
organic matter pools. The atmosphere is not part of 
the system and we do not track further the fluxes 
that go into the atmosphere, i.e., respiration repre- 
sented by arrows labelled ‘CO,’ in Figure 3.5. That 
being said, how much carbon goes into the atmo- 
sphere as carbon dioxide is an important topic as it 
is closely linked to climate change. Carbon balance 
equations for x, X;, X,, X, are given below: 


a dh (3.9) 
dt 
es = fesk3x3 + fs4k4X45 
dt (3.10) 
+ feckoxe + fezk 7x, — ksx 
dx 
i = fs4k4X4 + hsksX5 — k6X6 (3.11) 
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dx 
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(3.12) 


A further example of a multiple carbon pool is 
the ORCHIDEE model. Figure 3.6 shows the lit- 
ter and soil organic carbon (SOC) component of 
ORCHIDEE. As a rule of thumb, around 50% of 
soil organic matter is SOC. Later versions of this 
model separate SOC into different layers accord- 
ing to soil depths. However, the version depicted 
in Figure 3.6 tracks seven pools — four litter 
pools and three SOC pools — with different layers 
lumped together. Therefore, we have seven carbon 
balance equations. The idea is the same as what we 
have already worked through for the TECO model. 
We track the arrows that go into and out of each 
pool. For example, for the active SOC pool, x,, we 
have six arrows that go into this pool. Each arrow 
represents an incoming flux from a different pool, 
either litter or SOC. We will use f,,, for example, to 
represent the fraction of the carbon flux goes from 
the 1st pool (aboveground metabolic litter) to the 
fifth pool (active SOC). 

When we have a small number of carbon pools, 
it is not difficult to write down the carbon balance 
equations one-by-one. It would be very tedious to 
write down all the equations ifwe have many pools, 
for example, 100 for the version of ORCHIDEE that 
simulates SOC dynamics at different soil depths. 
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Figure 3.6. Flowchart and carbon balance equations of the ORCHIDEE model. 
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This version, known as ORCHIDEE-MICI, tracks 
seven types of organic carbon (four litter compart- 
ments and three SOC compartments), similarly to 
the version depicted in Figure 3.6. ORCHIDEE- 
MICT has 32 soil layers (Huang, et al., 2018a). For 
each soil layer, the model has active, slow and pas- 
sive SOC pools. Such a vertical soil discretization 
is especially helpful and more realistic in model- 
ing soil carbon dynamics in permafrost regions. 
In total, the model has 32 x 3 + 4 =100 pools. 
Instead of writing down the carbon balance equa- 
tion one-by-one, we can put them together into 
one matrix equation, as shown in Figure 3.7. The 
matrix form of the carbon balance equations says 
the same thing: the rate of change in carbon pool 


With vertical 
soil discretization 


co, 


size equals the input minus the output. Now it is 
not that obvious to spot out the input fluxes and 
the output fluxes. They are folded into the matri- 
ces and depend on the sign of the elements in 
these matrices. The first item on the right side of 
the matrix equation in Figure 3.7 summarizes the 
external carbon input into the system. The second 
block of matrices summarizes the turnovers and 
transfers of fluxes among different organic carbon 
pools. The third block of matrices captures vertical 
processes between the adjacent soil layers, such as 
bioturbation, diffusion, and advection. 

Variations in these matrices reflect structural 
or parametric differences among models. For 
example, the left side of Figure 3.8 shows the 
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Figure 3.7. Flowchart and matrix-form carbon balance equation of the ORCHIDEE-MICT model with vertical soil discretization. 
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Figure 3.8. Flowcharts of the litter and soil organic carbon components of CLM4.5 and ORCHIDEE-MICT. 
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flow diagram of the litter and SOM component of 
the CLM4.5 model. CLM4.5 tracks coarse woody 
debris, metabolic litter, cellulose litter, lignin lit- 
ter, fast, slow, and passive SOM in ten soil layers 
(Huang et al., 2018b). Different from ORCHIDEE- 
MICT, CLM4.5 also tracks the vertical distribution 
of litter. The matrix equation here takes a similar 
general form. Elements in each matrix are differ- 
ent. The dimension of the CLM4.5 matrices are 
70 x70 corresponding to 70 organic matter pools, 
while ORCHIDEE-MICT matrices are 100100. 

We will explore matrix representations in more 
detail in the following chapter. If you have under- 
stood the concepts introduced here, then big con- 
gratulations. These complex land carbon models 
are built upon these basics step by step. 


SUGGESTED READING 


Luo YQ, Weng ES. (2011) Dynamic disequilibrium of 
the terrestrial carbon cycle under global change. 
Trends in Ecology & Evolution, 26, 96-104. 


QUIZZES 


1. What does a pool stand for? 
2. What does a flux mean? 


3. What is the principle of writing carbon balance 
equation? 


4. Does burning of the organic matter by fire vio- 
late the carbon balance? 
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The exercises are designed to help you practice 
writing carbon balance equations from a carbon 
flow diagram and vice versa. The goal is to lead you 
to understand the concepts of flow, pool, and mass 
balance. Please refer to Chapter 3 if you have dif- 
ficulties with the exercises. 


INTRODUCTION 


Land biogeochemical models simulate carbon 
cycling through different pools in an ecosystem 
all according to carbon balance equations, regard- 
less of model structures. The balance equations are 
based on the law of conservation of mass which 
tells us that the change of one carbon pool size 
is always equal to the net difference between the 
incoming and outgoing carbon fluxes. Different 
pools in the ecosystem are connected via carbon 
transfer between them. To track carbon flows 
among different carbon pools, scientists usually 


EXERCISE 1: Writing carbon balance 
equations for the CENTURY model 


Figure 4.1 is the carbon flow diagram of the 
CENTURY (simplified) model. Plant residue 
is divided into structural and metabolic litter 
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develop a carbon flow diagram that uses boxes to 
indicate pools and arrows to indicate fluxes. Thus, 
the carbon flow diagram is often called a box- 
arrow diagram. The models that track carbon flows 
among pools are often called pool-flux models. 
The matrix form of land carbon cycle models is 
built upon the carbon balance equations that are 
connected via carbon transfer among pools. To 
learn the matrix approach to land carbon cycle 
modeling, it is essential to understand the carbon 
flow diagram and carbon balance equation. 

The first exercise of this unit is focused on writ- 
ing carbon balance equations from a carbon flow 
diagram, whereas the second exercise is about 
drawing the carbon flow diagram according to 
carbon balance equations. Both drawing the car- 
bon flow diagram from the carbon balance equa- 
tions and writing the equations from the diagram 
assist us in understanding land carbon dynamics 
and developing matrix models. 


according to its lignin and nitrogen content. 
Structural and metabolic litters are decomposed 
by microbes. The resulting microbial products 
become the substrate for the formation of soil 
organic matter in three soil pools (i.e., active, 
slow, and passive). For the decomposition flux 
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Figure 4.1. Flow diagram for the carbon flows in the CENTURY model (simplified). Reproduced from Parton et al. 


(1988). 


of the structural litter, a fraction (A) is incorpo- 
rated into the slow soil organic carbon with a 
turnover time of 25 years. We assume some of 
this remaining fraction goes as microbial respira- 
tion, and the other part contributes to the active 
soil organic C. The flow diagram also tells us the 
transfer fluxes among different soil carbon pools. 

Please write down the carbon balance equa- 
tions based on the carbon flow diagram (Figure 
4.1). Note that some details in the diagram such 
as L/N, 1-A, F (T) and CO, fluxes to the atmo- 
sphere are not shown as in the original paper 
by Parton et al. (1988).You could neglect infor- 
mation such as (3y) in the box for structural C 
and other similar boxes when you do this exer- 
cise. You might read the paper by Parton et al. 
(1988) if you are interested in this model. 

To write the balance equations, you could 
use symbols as in Chapter 3 or develop your 
own symbol system. Fundamentally, you need 
to figure out the amount of carbon moving into 
one pool, the amount of carbon moving out of 
the pool, and changes in the amount of carbon 
in the pool (i.e., pool size change) in one unit 
of time. Here are a few tips. First, you could 
define state variables, such as x1 to represent 
the pool of structural C, x2 to represent the 
pool of metabolic C, etc. Then, the size change 
in the pool of structural C can be expressed by 
dx1/dt. You could similarly write changes in 
carbon amounts in other pools. Second, you 
need to define rate variables to represent rates 
of carbon moving out of pools, such as k1 for 
a rate of carbon moving out of the structural C 


pool, k2 for a rate of carbon moving out of the 
metabolic C pool, etc. With the rate of carbon 
moving out of a pool and a state variable of 
the pool, you might figure out how to calculate 
the amount of carbon leaving the pool. Then, 
you need to figure out the amount of carbon 
moving to a pool. The third step is to define the 
total amount of carbon input, let’s say using a 
symbol I, from plant residue. Fourth, you need 
to figure out a fraction of the incoming carbon 
that enters a particular pool. For example, car- 
bon input from plant residue partly goes to the 
structure C pool and partly goes to the meta- 
bolic pool. You need to define a symbol, b1, as a 
fraction of incoming carbon going to the struc- 
tural C pool and b2 to the metabolic pool. Then 
the amount of carbon moving to the metabolic 
C pool is b2 x I. Similarly, you can figure out 
the amount of carbon moving to the structural 
C pool. However, it becomes much more com- 
plicated to figure out the amount of carbon 
that moves to the active soil C pool. The pool 
receives carbon from all the other four pools. 
Thus, it has four terms, which are: {31 * k1x1 
+ 132 * k2x2 + 134 x k4x4 + £35 + k5x5.Can 
you figure out how to get each of the terms? 
As a note, the symbol f31 describes the fraction 
of the carbon that leaves pool 1 and enters pool 
3. If you can go over these steps successfully, 
you may obtain one carbon balance equation, 


oe = 741% klxl+ [43 *k3x3—k4x4, the 


dt 
slow soil C pool. What are the carbon balance 
equations for the other pools? 


for 
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Exercise 2: Drawing a carbon flow 
diagram for the ReSOM model based 
on its carbon Balance equations 


This exercise aims to develop your ability 
to draw a carbon flow diagram from carbon 
balance equations using the ReSOM model. 
ReSOM is a REaction-network-based model of 
Soil Organic Matter and microbes (Tang and 
Riley, 2013; Tang and Riley, 2015). The model 


mineralization and sorption of monomers by 
microbes via extracellular enzymes and micro- 
bial assimilation, microbial death, and necro- 
mass decomposition. 

Key equations that govern exchange of 
carbon among pools over time in the ReSOM 
model are listed in the following Table 4.1 
together with definitions of symbols and 
parameters. All pools are in units of carbon 
mass per soil volume (g C m7). 


simulates depolymerization of polymers, 
TABLE 4.1 
Governing equations of RaSOM, pools and parameters and their definitions 

Pool Description Differential equation 

S polymeric organic carbon ma ls -R +ypB + fy gE (4.1) 
dD 

D monomeric organic carbon FA Ip +K- + YX + (1 k ) YrE (4.2) 
dX 

X reserve microbial biomass Te =Yx Bb - (x — g+ Yr ) X (4.3) 
dB 

B structural microbial biomass T = mă — (ya + Pe ) B (4.4) 

E extracellular enzymes = p:B — Y+E (4.5) 

where: 


I,: polymeric input flux (g C m”* d~!) 

Ip: monomeric input flux (g C m”* d!) 

Fs: polymeric depolymerization flux (g C m”* d~!) 
Fp: monomeric uptake flux (g C m= d!) 


Y: yield coefficient for reserve biomass (unitless) 


f,: fraction of decayed extracellular enzymes contributing to the polymer pool (unitless) 


Ya: microbial mortality rate (d”!) 
Yz: enzyme turnover rate (d”!) 

x: metabolic turnover rate (d~!) 
g: growth rate (d”!) 

Pr: enzyme production rate (d”!) 


m: structural microbial biomass formation rate (d~') 


Modified from Tang and Riley (2015) by Rose Z. Abramoff. 
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To draw a carbon flow diagram of the ReSOM 
model from its carbon balance equations, you 
need to figure out the following items: (1) the 
number of pools and what they are; (2) carbon 
transfer between any of the two pools; (3) carbon 
input from outside of the system; and (4) carbon 


loss from the system. Here is a hint to figure out 
carbon transfer between two pools from a carbon 
balance equation: a positive sign before a flux (e.g., 
F, in equation 4.2) indicates carbon entering the 
pool whereas a negative sign before a flux (e.g., Fp 
in equation 4.2) indicates carbon leaving the pool. 
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The goal of this chapter is to understand how the 
carbon balance of a system could be represented 
in matrix form, to be able to write a matrix car- 
bon model and, for advanced participants, to think 
about potential applications of the matrix model 
for your own research. 


WHAT IS THE MATRIX VERSION OF THE 
CARBON BALANCE EQUATION? 


We briefly introduced the matrix representations 
of the carbon balance of a multipool system in 
Chapter 3. Instead of writing down the carbon 
balance equations one-by-one for each pool, we 
can integrate all carbon balance equations within 
one matrix equation. This matrix equation cap- 
tures the carbon balance for each carbon pool and 
its linkages to other carbon pools of the system 
via the flows (fluxes) between them. One thing 
to keep in mind is that the matrix version of the 
carbon model, though it uses a different notation, 
is mathematically identical to the original carbon 
model that has one carbon balance equation for 
each carbon pool. Benefits of the matrix version of 
carbon models include simplicity in model struc- 
ture, high modularity in coding, clarity in diag- 
nostics, and computational efficiency in spin-up. 
These properties are discussed in detail in the fol- 
lowing Chapters: 6, 9, 14, 17, and 18. 
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The matrix version of many carbon models 
could be written as: 


dX 
— =BxrI4+A*K*X (5.1) 
dt 

Fach of these matrices could be time- 


dependent. For simplicity, we neglect the time 
dependence for now. The left-hand side of equa- 
tion 5.1 depicts the change rates of carbon pool 
sizes in our system. If we have, for example, seven 
pools, X here is a column vector with seven rows, 
one for each pool, which are the state variables of 
the system. The right-hand side of equation 5.1 
captures the incoming and outgoing terms, i.e., 
the fluxes entering or leaving the pools of the sys- 
tem. B is the allocation matrix, which functions to 
partition the external carbon input I into different 
carbon pools. In the plant-soil system of Chapter 3 
(Figure 3.2), you could think of as the photosyn- 
thetic carbon input, and B captures the fraction of 
plant-assimilated carbon that is allocated into dif- 
ferent plant organs. K is the turnover rate matrix. 
A is then X n dimensional transfer matrix, where 
n is the number of carbon pools. A captures the 
fraction of outgoing carbon flow from one pool 
that goes into a second pool. From the location 
and sign of elements in A, we could reconstruct 
the network of carbon transfers among different 
carbon pools. 
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The flow diagram and carbon balance equations 
of the TECO model were presented in Chapter 3 
(Figure 3.5b, equations3.6-3.12). In the case of 
TECO, which has seven carbon pools, B and X are 
7X1 vectors, while A and K are 7X7 matrices. K is a 
diagonal matrix with each diagonal element k cor- 
responding to the turnover rate of a carbon pool. 
All other elements in K are set to zero. The diago- 
nal elements of A are all -1. Non-zero elements 
of the off-diagonal parts of A correspond to the 


dx, (t) / de B; -1 0 
dx, (t) / de By 0 —1 
d; (t)/dt | | 0 fi 0 
dx, (t)/dt |=| 0 |I+| £u fy 
dxs(t)/dt | | 0 0 0 
dxg(t) / dt 0 
dx, (t) / de 0 

kı 

k, 
k; 
x 


This matrix version of the carbon model is 
especially handy when we have many carbon 
pools. We will take the litter and soil organic 
carbon components of CLM 4.5 (70 pools) and 
ORCHIDEE-MICT (100 pools) as examples. The 
matrix equation for CLM4.5 and ORCHIDEE-MICT 
look similar to the matrix equation for TECO. In 
Figure 5.1 we have one more item: the V matrix, 
which captures the vertical transfers of carbon 
between adjacent soil layers. The A matrix is more 
complex than the earlier case. Both CLM4.5 and 
ORCHIDEE-MICT assume the fraction of carbon 
transferred between different organic carbon pools 
is the same for different soil depths. Here, A has the 
form of a block matrix, which means A is made up 
by smaller matrices, i.e., blocks. For CLM4.5, each 
block, for example, A31, is a 10X10 matrix, cor- 
responding to 10 soil layers. The diagonal elements 
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fraction of carbon fluxes transferred from the it 


to the j* pool, where i is the column number and j 
the row number, starting at (1,1) from the top-left 
corner of the matrix. For example, f,, in the first 
column and fourth row tells us there are carbon 
fluxes transferred from the first to the fourth car- 
bon pool, that is, from foliage biomass to structure 
litter. All transfer fluxes in the carbon flow diagram 
can be found in the A matrix. In the case of TECO, 
equation 5.1 can be expanded as: 


k4 
ks 
ke 


k; 


in A31 are the same since the model assumes the 
fraction of transfers do not change with depth. For 
ORCHIDEE-MICT, it is a little more complex as lit- 
ter pools are not separated by soil depth. So trans- 
fers of litter carbon fluxes to soil carbon fluxes 
(for example, A51) are 32x1 vectors and zeros. 
For transfers among soil organic carbon pools, 
for example A65, the matrix has a dimension of 
32x32, with each element representing the trans- 
fer from one of the 32 active soil organic carbon 
pools to one of the 32 slow soil organic carbon 
pools. £ is the matrix of environmental scalars that 
quantify the deviation of the actual decomposi- 
tion rate from the potential rate due to the non- 
optimal environmental conditions. We could add 
it to the TECO model also. Please check in Huang 
et al. (2018b) and Huang et al. (2018a) for more 
details if you are interested. 
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Figure 5.1. Matrix equation of CLM4.5 and ORCHIDEE-MICT. 
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Figure 5.2. Matrix equation of the litter and soil organic carbon component of the ORCHIDEE model. The flow diagram and 


carbon balance equations are shown in Figure 3.6, Chapter 3. 


Based on these examples, we see that the matrix 
equations for different carbon models may take 
a similar form. But the structure of each matrix 
might be different. How each matrix is organized 
depends on the structure of the original model. 


HOW TO DERIVE THE MATRIX EQUATION? 


After we know what the matrix looks like, the next 
step is to derive the matrix model from the carbon 
balance equations. 

Here we take the ORCHIDEE model in Figure 
3.6 of Chapter 3 as an example. We assume we have 
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I, L,I; and I as inputs to the respective litter pools. 
Carbon from the litter pools is transferred into dif- 
ferent soil organic carbon pools. Additionally, there 
are internal carbon transfers between different soil 
organic carbon pools. Figure 5.2 shows the matrix 
equation of this model. X and I are column vec- 
tors, and A and K are 2-dimensional matrices. The 
dimensions of X, I, A and K are determined by the 
number of carbon pools. X is easy to construct. We 
put the seven carbon pools x,,x,,...x, in a column. 
I is a column vector with I, to I, as the input rates 
for litter carbon pools and zero for soil organic 
carbon pools as these pools receive no external 
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How to construct the transfer matrix A? 


dx; 


a > = fsakyxy + fszk2Xz + f53K3X3 + fs4k4X4 + fsekoXe + f 57K7X7 — Ko Xs 


dx, 
TI = f 6333 + f 64K4X4 + fosksXs — ke x6 


dx, 
de F75k5X5 + f76H6X%6 — K7X7 


Figure 5.3. Illustration of the function of the transfer matrix A. 


carbon inputs. K is a diagonal matrix with each 
diagonal element corresponding to the turnover 
rate of one of the carbon pools. 

Mathematically, a n X n matrix (A matrix) mul- 
tiplied by a n x 1 vector (X) will produce an x 1 
vector. The value of it row of this multiplication 
equals the sum of the products between elements 
from the i* row with the column vector. The mul- 
tiplication of K and X will give us a column vector 
with elements k,x,, k,x,,...k,x,. For the multiplica- 
tion of A by KX, we take the sixth row as an example 
(Figure 5.3). We have 0 multiply k,x,, 0 multiply 
kx, fe multiply k,x,, fg, multiply k,x,, fgs multiply 
k,x,, and -1 multiply k,x,, which is exactly the right 
side of the carbon balance equation for the sixth 
carbon pool. The diagonal elements of A are all — 1. 
There are many 0 elements in A. These Os indicate 
there are no carbon transfers from the it (column 
number) to j (row number) carbon pools. 

If we look at each column of the A matrix, it 
summarizes all the fluxes that go out of the i car- 
bon pool (Figure 5.4). For example, after reading 
the fourth column of the A matrix, we know there 
is carbon transferred from the fourth to the fifth and 
sixth carbon pool if f,, and fe are not equal to 0. 
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Aboveground | Belowground | Aboveground | Belowground 
Metabolic Litter | Metabolic Litter] Structure Litter | Structure Litter 
xy x, PA X4 
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A | 


co, 


This is exactly what we see in the carbon flow dia- 
gram (Figure 3.6 of Chapter 3). If we sum up all 
the off-diagonal elements in one column and the 
value is smaller than 1, for example, f,, + fu < 1, 
this means there is carbon leaving the system which 
we do not track further. For example, if our model 
is focused on the storages of organic carbon in soil, 
we may not care about carbon that leaves the soil 
system in the form of CO, flux to the atmosphere. 

Each row of the matrix A summarizes all the car- 
bon fluxes that are transferred into the i" carbon pool. 
For example, if we look at the sixth row, we have fe, 
fas fes as non-zero, off-diagonal elements. This tells 
us there are fluxes from the third, fourth and fifth 
carbon pools being transferred into this sixth carbon 
pool. The relevant transfers are highlighted in yellow 
in the carbon flow diagram of Figure 5.5. 

We have seen that a single matrix equation can 
encompass the details of multiple processes simu- 
lated by a carbon model. Detailed information 
about the carbon dynamics is folded into differ- 
ent components of the matrices. If we expand the 
matrix calculation by row, we should get the exact 
set of carbon balance equations, one equation for 
each pool. 
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A matrix by row: 
summarize fluxes transferred into the ith carbon pool 
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x4 Xz X3 X4 
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Figure 5.5. Illustration of the function of the row of the transfer matrix (A matrix). 
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Figure 5.6. Illustration of the vertical transfer matrix V of the ORCHIDEE-MICT model. 


Next we are going to explore a little more on 
the matrix equation for models with vertical soil 
layers. For such a model, we require one more ver- 
tical transfer matrix, V (Figure 5.6). The X, I and K 
matrices are generally constructed in a same way as 
in the earlier examples above. The transfer matrix A, 
with 100x100 dimension, is folded into the block 
matrix. Each of the smaller blocks in black, for 
example A51, is a depth-dependent vector, track- 
ing the fraction of litter carbon transferred into 
SOC in different soil depths. The matrices in red 
are 32x32 diagonal matrices capturing transfers 
between different soil organic carbon categories in 
the same soil layer. Each item of the diagonal takes 
the same value, as ORCHIDFE-MICT assumes the 
transfer fractions are not depth-dependent. 

The vertical transfer matrix can be represented 
by the block matrix shown in Figure 5.6. Most of 
its components are 0 except for the active, slow, 
and passive SOC pools. Each diagonal block is a 
tridiagonal matrix that describes vertical redistri- 
bution of corresponding carbon pools among dif- 
ferent soil layers. As the vertical transfer rates are 
not differentiated among different types of carbon 
pools, V55 (t), V66(t), and V77 (t) are identical. The 
subscript numbers indicate soil layers; h and g cor- 
respond respectively to the mixing rate between 


the current soil layer and the one above it, and the 
current soil layer and the one below it; z indicates 
the depth of soil layer i. Detailed information is 
available in Huang et al. (201 8a). 

In this chapter we have learned how to derive 
the matrix equation from the carbon balance equa- 
tions. It would also be possible to derive the matrix 
equation directly from the carbon flow diagram. 
From the carbon flow diagram, we can determine 
the dimension of matrices (or vectors) from the 
number of carbon pools. The diagram also tells 
us the external carbon inputs, the incoming and 
outgoing fluxes of each pool, and the direction of 
these fluxes. These are basically what we need to 
construct the matrix equation. 
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QUIZZES 3. How would you add a new carbon pool into the 


matrix equation? 
1. Is the sum of each row of the transfer matrix A 


always not bigger than 0? 


4. Why is a matrix equation exactly the same as a 


carbon balance equation in terms of describing 


2. Is the sum of each column of the transfer matrix the carbon cycle? 


A always not bigger than 0? 
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Carbon sequestration in terrestrial ecosystems is 
strongly regulated by nitrogen processes. Many 
global land models now simulate the carbon and 
nitrogen interaction. The goal of this chapter is 
to understand how coupled carbon and nitrogen 
models at ecosystem and global scales may be rep- 
resented in the matrix form. Basically, nitrogen 
transfers among all the organic nitrogen pools can 
be represented in one matrix equation that is equiv- 
alent to the carbon matrix equation.The carbon and 
nitrogen matrix equations are coupled through the 
C:N ratio. Mineral nitrogen dynamics, determined 
by nitrogen input, mineralization, plant uptake and 
leaching, can also be described by one equation. 


INTRODUCTION 


Rising atmospheric carbon dioxide (CO,) concen- 
tration tends to induce carbon (C) sequestration 
in terrestrial ecosystems. The conceptual frame- 
work of progressive nitrogen (N) limitation has 
predicted N limitation on future C sequestration 
in terrestrial ecosystems in response to rising 
atmospheric CO,. The N limitation may become 
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progressively stronger over time unless N fixation 
is stimulated and/or N losses are reduced, leading 
to increased N capital (Luo et al., 2004). In addi- 
tion, the degree of N regulation on terrestrial C 
sequestration depends on changes in several C-N 
coupling parameters, such as the stoichiometric 
flexibility of C:N ratios of biomass compartments, 
changes in plant N uptake via soil exploration, and 
N redistribution from soil to vegetation. Encoding 
what we know about how C and N flow within 
ecosystems, and how these flows are coupled, can 
help to fully understand the strength of N regula- 
tion on C sequestration in terrestrial ecosystems. 

In this chapter, we will show how C and N 
cycling are coupled in an ecosystem model and a 
global land model, and how the coupled processes 
can be represented in the matrix form. 


MATRIX REPRESENTATION OF CN COUPLING IN 
TERRESTRIAL ECOSYSTEM(TECO) MODEL 


One of the coupled C and N models presented in 
this chapter is developed from the terrestrial eco- 
system (TECO) model (Weng & Luo, 2011).TECO 
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Figure 6.1. Carbon and nitrogen pools and pathways of carbon and nitrogen fluxes in the TECO-CN model. Blue arrows show 


carbon cycling processes, while pink arrows indicate nitrogen cycling processes. SOM = soil organic matter. 


was first presented in Chapter 1, and its matrix rep- 
resentation introduced in Chapter 5. The coupled 
C-N version we will discuss here, called TECO-CN, 
has eight C and N pools in addition to one mineral 
N pool. C enters the ecosystem through canopy 
photosynthesis and is then allocated into foli- 
age (X,), wood biomass (X,) and fine roots (X;) 
(Figure 6.1). Similarly, N is absorbed by plants 
from mineral soil, and then partitioned among 
leaf (N,), woody tissues (N,) and fine roots (Ny). 
Detritus from plant biomass turnover is trans- 
ferred to metabolic litter (X,) and structure litter 
(X,) pools, where it is decomposed by microbes 
(X,). The structure litter (X,) is partly respired 
while partly converted into fast (X4) and slow soil 
organic matter (SOM, X,). During these C cycling 


-1 0 
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0 =1 
i= La Lo Las 
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processes, N in plant detritus is also transferred 
among the same set of ecosystem pools (i.e., litter, 
fast, slow and passive SOM). Mathematically, these 
C processes may be described by the following 
first-order ordinary differential equation: 


ars AS (t)KX(t) 


(6.1) 
where X = (X),X),X3,X4,Xs,X¢,X7,Xg), represents C pools 
in leaf, wood, fine roots, metabolic litter, structure 
litter, microbe, slow and passive SOM, respectively. 
E(t) is a vector of environmental scalars account- 
ing for temperature and moisture effects on all 
C decomposition. A describes C transformation 
among various ecosystem compartments, given as: 


0 0 0 0 

0 0 0 0 

0 0 0 0 

-1 0 0 0 0 
0 -1 0 0 0 
fa ts  —1 kz ks 
0 bs be -1 0 
0 0 fhe kz -l 


COUPLED CARBON-NITROGEN MATRIX MODELS 


The non-zero elements (fj) in matrix A describe 
C transfer coefficients (i.e., the fractions of the C 
entering the it pool from the j! pool), while the 
zero elements indicate no C flows between these 
two pools. K is an 8 X 8 diagonal matrix with diago- 
nal entries given by vector K = (k;, k, k; ka, ks, ke, k; 
k¿) is baseline turnover rate of carbon pools (i.e., the 
amounts of C per unit mass leaving each pool per 
day). I is the input and B = (b,,b,,b,, 0, 0, 0, 0, 0) is 
the allocations of input into ecosystem carbon pools. 

The N processes can be described by: 


ÎN _ 


a e (t) KR X(t)+K,Ninin (E) IT (6.2) 


where N = (n, n, D3, Ny, Ns, Ng, Ny, Mz), represents 
N pools in leaf, wood, fine roots, metabolic litter, 
structure litter, microbe, slow and passive SOM, 
respectively. R is an 8 X 8 diagonal matrix with 
diagonal entries given by vector R = (1,, L,, 13, i% 
Ts, Ie, Iņ, Ie), representing C:N ratios in the eight 
organic N pools. II = (7,,1,,1 — 7, — 2,,0,0,0,0), 
is a vector of allocation coefficients expressing the 
proportion of plant N uptake that is allocated to 
leaf, wood and fine root biomass pools. x, is the rate 
of plant N uptake per time step (year), expressed as 
a proportion of Nan (t), the amount of plant avail- 
able N (mineral N) in soil at time t. The dynamics 
of the mineral N pool are determined by balance 
between N input (i.e., N mineralization, biological 
fixation and atmospheric deposition) and output 
through plant N uptake and N loss (i.e., leaching 
and gaseous N fluxes), which can be expressed by: 


(0) = (6. +H) Nant) 
dt (6.3) 
+é (t)o AKR 'X(t)+F(t) 


where K, and K, are rates of N uptake and loss, 
respectively. The second term on the right side of 
Equation 6.3 describes the amount of N released 
during mineralization. g; is the proportion for 
mineral N production. The C and N cycles are cou- 
pled through the parameter R which is C:N ratio. 
F(t) represents N input through biological fixation 
and atmospheric deposition. 


APPLICATION OF MATRIX REPRESENTATION 
OF CN COUPLED MODEL 


To illustrate the application of matrix forms of a 
C-N coupled model, we designed a case study to 


examine changes in C-N coupling parameters under 
CO, enrichment using a data assimilation approach. 
Based on measurements of C and N pools in various 
ecosystem compartments (i.e., foliage, woody tis- 
sues, fine roots, microbe, forest floor soil inorganic 
N, and mineral soil) and fluxes (i.e., litterfall, soil 
respiration and mineralization, and plant N uptake, 
N input from biological fixation and atmospheric 
deposition) obtained from the Duke Forest Free- 
Air CO, Enrichment (FACE) experiment in North 
Carolina, USA, during the period 1996—2005. Key 
parameters of TECO-CN (i.e., C:N ratio, N uptake, 
N allocation coefficient, N input, N loss and the ini- 
tial value of mineral soil N pool) were estimated 
through a Bayesian probabilistic inversion. The 
inversion was done separately for plots with ambi- 
ent versus elevated CO, treatments, yielding one set 
of parameters for each (Shi et al., 2016). 

Comparison of parameter distributions showed 
that plant N uptake, C:N ratios in foliage, fine root, 
metabolic and structural litter were significantly 
higher under elevated than ambient CO,, whereas 
CO, enrichment did not exert significant effects on 
C:N ratios in wood tissues and SOM. Moreover, ele- 
vated CO, led to decrease of C exit rates in foliage, 
woody biomass, structural litter and passive SOM, 
indicating an increase of C residence time in these 
ecosystem compartments. By contrast, elevated CO, 
resulted in the increase of C exit rate in fine roots, 
demonstrating faster fine root turnover under CO, 
enrichment. In addition, C allocation to the foliage 
became smaller under elevated CO,, while C allo- 
cation to fine roots tended to be larger under CO, 
enrichment. The estimated parameters were then 
used for a forward analysis to examine ecosystem 
C and N dynamics under ambient and elevated CO, 
conditions at Duke Forest. Our results demonstrated 
that modeled N pools in foliage, woody tissues, 
fine roots, and forest floor closely matched with 
the corresponding measurements for both ambient 
and elevated CO, scenarios (Figure 6.2). However, 
TECO-CN could not capture the observed declining 
trend of microbial N content with time. In addition, 
the trained model did not simulate N dynamics in 
mineral soil well, partly due to the large variations 
in SOM measurements among different years. 


MATRIXREPRESENTATION OF C-N COUPLING 
INCLM5 


The other C-N coupled model we will examine in 
this chapter is the Community Land Model version 
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Figure 6.2. The comparisons of modeled v. measured nitrogen pools in various ecosystem compartments under ambient CO, 


(blue lines) and elevated CO, (red lines). 


5 (CLM5, Figure 6.3). To extend the ecosystem 
scale to global scale, CLM5 simulations represent 
various ecosystems on a grid covering the global 
land area. Like TECO-CN, the CLM5 biogeochem- 
istry module includes carbon and nitrogen cycles 
for aboveground and belowground processes 
(Lawrence et al., 2019). The vegetation modules 
simulate the biogeochemical transfers among 18 
carbon pools and 19 nitrogen pools, representing 
vegetation and soil compartments of a grid cell or 
tile. Tissue pools, storage pools and transfer pools 
are included, respectively, in leaf, fine root, live 
stem, dead stem, live course root, and dead course 
root compartments. In addition to C pools, N 
pools include one more retranslocation pool tem- 
porarily storing N from litter fall. The soil biogeo- 
chemistry module has 20 soil layers as a default 
setting. Each layer contains seven pools for organic 
C and organic N, respectively, in metabolic litter, 
cellulose litter, lignin litter, coarse woody debris, 
fast soil organic matter, slow soil organic matter, 
and passive soil organic matter. Inorganic N pools, 
such as ammonium and nitrate, interact among 
each other, or with organic C and organic N pools, 
or with the environment through multiple nitro- 
gen processes such as nitrification, denitrification, 
leaching, atmospheric N deposition, and biologi- 
cal N fixation. All these biogeochemical processes 
and pools in both vegetation and soil modules 


can be formulated as carbon or nitrogen balance 
equations. 

We may reorganize the vegetation carbon and 
nitrogen balance equations of CLM5 into two 
matrix equations: 


Eu = Blein + (Ape (t) K pre + Age (t) Ke 
6.4 
+ Ane (t) Kc ) Cog (t) (6.4) 
A = Blea + (Apia (t) Kara 
dt (6.5) 
+ Año (t) Kin ) Now (t) 


C,, and N,, are time-dependent state vari- 
ables, which are vectors, each entry represent- 
ing its respective vegetation pool size (gC m? 
and gN m7”). Icn and Iyi are scalars for C and N 
input, respectively. C input is the net primary pro- 
ductivity, which is the difference between gross 
primary productivity (i.e., plant photosynthesis) 
and autotrophic respiration. N input to the vegeta- 
tion N cycle includes both biological N fixation 
and N uptake from roots. B and By are also vec- 
tors, representing allocation fractions of plant C 
and N input to individual pools. The K matrices are 
n X n diagonal matrices whose diagonal elements 
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Figure 6.3. Carbon and nitrogen flow diagram of vegetation biogeochemical cycle in CLM5. GPP = gross primary production; 


NPP = net primary production. 


represent turnover rates (pool fraction per annual 
time step) due to different plant and vegetation 
processes: subscripts ph, gm, and fi indicate phenol- 
ogy, gap mortality (i.e., harvest from land use and 
natural mortality), and fire processes, respectively, 
as described by: 


kn 0 
Ka =] + 

0 Kon 

Kai 0 
Ko =|: 

0 lia 


The turnover rates in plant phenology matrix 
K indicate the leaf, root, live stem, and dead stem 
turnover due to phenology processes. The exit 
rates in gap mortality matrix K include harvest 
rates from land use plus natural mortality. The exit 
rates in fire matrix K represent the plant C loss rate 
due to fires. 

A is a transfer coefficient matrix, representing C 
and N transfer among pools as specified in equa- 
tion 6.6-6.11 below for CLMS. Subscript c and n 
respectively denote carbon and nitrogen. 
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The off-diagonal entry, a, ¡, for matrix A repre- 
sents a fraction of C or N leaving pool j that goes 
to pool i. The diagonal entries are set to — 1 to rep- 
resent that all the exiting C leaves pool j. The pool 
names referred by the subscript i or j can be found 
in Figure 6.3. 

Interactions between vegetation C and N cycles 
in the original CLM5 are fully preserved in this 
matrix version of the CLM5 model. The origi- 
nal CLMS has two modules to regulate C and N 
interactions for vegetation. First, the photosyn- 
thetic capacity, an important variable driving the 
carbon cycle, interacts with the nitrogen cycle in 
the LUNA module. LUNA optimizes N allocation 
to maximize the daily net photosynthetic carbon 
gain. Second, plant N uptake interacts with the car- 
bon assimilation in the FUN module. FUN is based 
on the concept that N uptake requires the expen- 
diture of energy in exchange for carbon. Both the 
modules are fully preserved to drive the C and N 
interactions in the matrix version of the model. 

The implementation of soil C and N cycling 
extends the maximum soil layers from a fixed 
value of 10 in CLM4.5 (see Chapter 5) to a default 
value of 20 with flexibility to change in CLM5. 

N balance equations in the original CLM5 are 
likewise amenable to a matrix representation. The 
soil organic C and N transfer among soil pools is 
formalized by the following matrix equations: 


= Icsoil + (Ag (t)K, 


+ V (1) + Ki (t)) Cos (t) 


AC 
dt (6.6) 


E | 


0 sia (1- 


AN sos 


= ÎNsoil + (Anë (1)K, 
(6.7) 
+V (t)+K; (+) Nout (t) 


As with vegetation, C,,, and Ni, are vectors of state 
variables, representing soil organic C and N pool 
sizes in gC m”* and gN m~, respectively. es and 
Is are Vectors representing plant litterfall into 
different litter C and N pools, respectively. A,, and 
A,, represent the horizontal transfers of C and N, 
respectively, which means transfers among pools 
within one soil layer. V stands for the rate of verti- 
cal mixing between the same types of pools across 
soil layers, which is the same for C and N. K, is the 
rate of fire-induced litter loss, which is the same 
for C and N. 

Matrix A, including both A, and Ay, is a block 
matrix constituted by several matrices as: 


soil 


A, 0 0 0 0 0 0 
0A» 0 0 0 0 0 
As 0 430 0 0 0 
AncOT Am =| A4 0 0 A4 0 0 0 


0 Aso Ass 0 Ass Au Au 
0 0 0 Ags Ags As 0 
0 0 0 0 Az; A7477 


Each of the matrices A, within the block matrix 
Ap or Ay, is 20 X 20, corresponding to 20 soil lay- 
ers. The non-zero, off-diagonal matrices Ap i # j, 
indicate C and N transfer among pools within one 
soil layer as: 


i= j; Cand N cycles 


iz j;Ccycle (Aso ) 


iz j; N cycle (Am ) 


jr, N 


i,20 
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Each of the diagonal matrices A, is a negative 
identity matrix (i.e. a 20x20 matrix with diagonal 
elements equal to — 1 and all other elements zero). 
The off-diagonal matrices A, have non-zero diago- 


CN; 
= r) T; for carbon and (1 hi ) iati 
CN 


nal values, (1 
for nitrogen. A, represents transfer coefficients. 1, is 
the respired fraction of C along the transfer path- 
way from pool j to i. T, represents a pathway frac- 
tion of C going to pool i from that leaving the j* 
pool due to decomposition. CN, represents C:N 
ratio of pool j in layer k. 

The diagonal matrices K, and K, indicate the 


turnover rates, respectively, due to decomposition 
(horizontal transfer) and fire, at different layers: 


k, 0 
k=l: 
0 k, 
kp 0 
k=] 2 A 
0 Kin 


Environmental scalars € are time-dependent 
variables and are the product of temperature scalar 
Er water scalar Ew, oxygen scalar éo, depth scalar 
Ep and nitrogen scalar éy: 


é (t) = Er (t)Ew (t) E0EEn (t) 


The vertical mixing coefficient matrix V is 
made up of 6 identical matrices v: 


v000000 
0v00000 
00v0000 
v(t)=[0000000 
0000v00 
00000v0 
000000v 


Note that vertical mixing of coarse woody 
debris (CWD) is not allowed in CLMS; therefore, 
the corresponding vertical mixing matrix is 0 for 
CWD. The matrix v is a tridiagonal matrix, indi- 
cating the vertical mixing only transfers between 
adjacent layers: 


gi ~q 0 
h h+g gp 
0 —h; h; + 93 0 0 
v= diag(dz,dz2,...d220) ` : 
0 his + gis —fis 0 
—hi9 hio + gis —Jis 
0 0 —ho hzo 


The subscripts represent the soil layer; g and 
h are vertical mixing rates related to upward and 
downward transfers. 

As for the vegetation component of CLM5, 
interactions between C and N cycles in the soil 
from the original CLM5 model are fully preserved 
in the matrix version. In the original CLM5, nitro- 
gen limits soil organic C decomposition. The N 
limitation is represented by the ratio between avail- 
able mineral N and the total soil N demand. The 
soil N demand includes soil immobilization during 
soil decomposition and plant uptake demand. The 
dynamics of mineral N processes that are involved 
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in C and N interactions can be represented by one 
equation as fully described in Shi et al. (2016). The 
equation on mineral N dynamics can be coupled 
with the matrix equations on organic C and N pro- 
cesses to analytically or semi-analytically explore 
their interactions. The matrix form recoding simpli- 
fies the original representation in complex C and 
N transform network of a biogeochemistry model 
like CLM5. Traceable components of the C and N 
cycles can thereby be abstracted and help build up 
surrogate models with less computational cost than 
the original models. Additionally, more robust diag- 
nostic capability is brought out by the matrix form. 
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GLOBAL VALIDATION OF THE CLM5 MATRIX original CLM5 use the same default initial size of 
MODEL FOR CAND NSIMULATIONS vegetation and soil C and N storage without spin- 
up for the transition simulation. Modeled C and 
N storage in total ecosystem, vegetation, and soil 
organic matter from the matrix model matched 
with those from the original CLM5 (Figure 6.4). 
It is worth mentioning that the matrix module is 


The temporal dynamics of C and N storage simu- 
lated by the matrix modules was compared with 
the dynamics simulated by the original versions of 
these modules. The CLM5 matrix model and the 


T a) Vegetation C (GtC) b) Soil C (GtC) c) Ecosystem C (GtC) 
3 
2 580 1956 
2650 
X 
3 
3 
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O 550 2620 
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E 
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Figure 6.4. Comparison of simulated global carbon (C) and nitrogen (N) stocks from 1901 to 2014 between the CLM5 matrix 
model and the original CLMS. (a) Vegetation C storage, (b) soil C storage, (c) ecosystem C storage, (d) vegetation N storage, 
(e) soil N storage, and (f) ecosystem N storage are summed up over all land grid cells each year for comparison. Relative dif- 


X matrix -X 


ferences averaged over last four year (2011-2014) are calculated as: rare! 100. X 


"matrix 


represents (g) vegetation C 
X original 

storage, (h) vegetation N storage, (i) soil C storage, (j) soil N storage, (k) ecosystem C storage, and (1) ecosystem N storage from 

the CLM5 matrix model. X igna is the counterpart from the original CLM5. 
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not an exact representation of the original model 
because different processes in the original model 
are updated stepwise, while all processes in 
the matrix model are updated simultaneously. 
However, our validation results show that the rela- 
tive differences of soil C and N are less than 1% 
for most grid cells. Only 0.4% and 0.5% of the 
grid cells in vegetation C and N storage, respec- 
tively, diverged by more than 1% relative differ- 
ence (Figure 6.4g—h). Only two of 1466 global 
grid cells in ecosystem C and N storage diverged 
by more than 1%. Global annual means in all six 
C and N storages over 115 years perfectly lined up 
on the 1:1 line, signifying exact agreement (Figure 
6.4a-f). The good agreements demonstrate that 
the matrix model can faithfully reproduce both 
temporal dynamics and spatial patterns of C and N 
states from the original model. 

As mentioned before, minor divergence between 
the two model versions could be explained by the 
difference between the simultaneous nature of the 
calculations encapsulated by the matrix equations, 
versus the stepwise nature of the original model’s 
algorithms. The matrix modules update C and N 
state variables only once within each time step, 
whereas the original modules update these state 
variables one after the other within each time step. 
This means that updates in a state variable calcu- 
lated early in the sequence can affect the calcula- 
tion for a subsequent state variable. For example, 
leaf turnover is updated by both phenology and 
fire in sequence in the original model. Thus, the 
turnover in leaf C from fire is proportional to the 
pool size after being updated by phenology in the 
original model, while the turnover in leaf C from 
fire is proportional to the pool size before phenol- 
ogy in the matrix model. As a consequence, simu- 
lated dynamics may slightly differ between the two 
model realizations. Nevertheless, the differences in 
modeled state variables due to different update 
methods of the two models were small enough not 
to generate notable differences as shown in global 
simulation of the C and N storage. 

To summarize, we have shown in this chapter 
how matrix equations may be used to represent 
C-N coupled models for application at ecosys- 
tem and global scales. The new representation as 
a matrix equation for C-N coupled models has 
several advantages. The matrix form could help 
build up a surrogate model for parameter inver- 
sion and parameter sensitivity analysis, espe- 
cially for understanding the role of different C-N 


interactions for N limitation of CO, fertilization. 
Previously, parameter inversions or parameter sen- 
sitivity analysis of terrestrial biogeochemical mod- 
els like CLM5 have been prohibitively expensive 
computationally. The surrogate model in matrix 
form greatly saves the computational cost (Huang 
et al., 2018b; Tao et al., 2020). Another advantage 
of the matrix form is that it enables a strong diag- 
nostic approach, the traceability analysis, which 
attributes the differences in simulated C and N 
storage into several traceable components (see 
Chapter 17). These traceable components convey 
critical ecological or biogeochemical meanings 
which can help us understand the key drivers of 
the spatial and temporal variability in terrestrial C 
and N cycles emerging from model simulations. 
For example, water scalars have been identified 
as the most significant traceable component to 
explain wide divergence between estimates of per- 
mafrost carbon storages driven by two reanalysis 
meteorological datasets, GSWP3 and CRUNCEP 
(Lu et al., 2020). The success in capturing the 
dynamics of biogeochemical cycling of a com- 
plex model such as CLM5 indicates the feasibility 
of implementing matrix form in other terrestrial 
biogeochemistry models. 
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QUIZZES 


1. What is the matrix form of ecosystem nitrogen 
cycling? 
2. What are the differences between carbon and 


nitrogen cycles in the matrix form? 


3. What are the key parameters to couple carbon 
and nitrogen cycles? 


4. What drives the dynamic of the mineral nitro- 
gen pool? 
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Models of the terrestrial carbon cycle are particular 
cases of compartmental dynamical systems, which 
are systems of differential equations that must con- 
serve mass. This chapter introduces the main math- 
ematical properties of compartmental dynamical 
systems and proposes a classification scheme that 
is useful for the analysis of carbon cycle models. 
This classification scheme distinguishes between 
models where carbon inputs and rates change over 
time or remain constant (nonautonomous versus 
autonomous models), and between models in 
which the amount of mass in compartments inter- 
act with mass in other compartments (nonlinear- 
ity). We show that simple concepts such as steady 
state may not be well defined for some groups 
of models, and present alternative concepts such 
as the pullback attractor for the analysis of mod- 
els with no steady state. In addition, this chapter 
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introduces the theoretical basis for the mathemati- 
cal analysis of models written in matrix form. 


INTRODUCTION 


The matrix representation of models has emerged 
as a very general representation of ecosystem mod- 
els, particularly models that track the movement 
of carbon, nitrogen, and other elements inside 
vegetation and soil pools (Mulholland and Keener 
1974; Matis etal. 1979; Bolin 1981; Luo and Weng 
2011; Xia et al. 2013; Luo et al. 2017). For soil 
organic matter models, some of the first represen- 
tations in matrix form were the models of Bolker 
et al. (1998), Baisden and Amundson (2003), and 
Tuomi et al. (2009). For these authors, the matrix 
representation helped them organize the set of 
differential equations that resulted in their model 
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in a more manageable and compact form. This is 
also the case in other fields of science such as biol- 
ogy or chemistry, where large sets of differential 
equations can be organized using this compact 
representation. 

In fact, any model that represents the mass bal- 
ance of a quantity such as atoms or molecules, can 
be represented in this form. Compared to other 
systems of differential equations, mass-balanced 
systems are special in the sense that all quantities 
are generally non-negative; i.e., the information 
that is fed into the model, and the predictions it 
produces can only exist inside the domain of the 
positive real numbers. Furthermore, the mass bal- 
ance constraint leads to a special type of dynamical 
system known as a compartmental system. 

We will now introduce the mathematical con- 
cept of compartmental systems, and will show that 
models written in compartmental form have a spe- 
cific set of mathematical properties. These prop- 
erties however, depend on the specific structure 
analyzed, mostly on the time dependence of the 
elements of the model and intrinsic nonlinearities. 


DEFINITION OF COMPARTMENTAL SYSTEMS 


We start by the defining a compartment as an amount 
of material that is kinetically homogeneous and 
that follows the law of mass balance. The meaning 
of “mass balance” is elaborated below. A compart- 
mental system therefore, is a set of compartments 
that exchange mass with each other and with the 
external environment. This implies that a compart- 
mental system is an open system with an observer 
defined boundary (Anderson 1983; Jacquez and 
Simon 1993). 


Ti 


Figure 7.1. The mass balance of a single compartment. 


Let's consider the mass stored in the compart- 
ment i, denoted by x, as the balance between 
(Figure 7.1): 


eu 


i 


> 0 inflow (uptake) from outside the 
system, 


er 


i 


> 0 outflow (release) to outside the 
system, 


° F.> 0 flow transfers from compartment i to 


p= 


compartment j, 


* F; > 0 flow transfers from compartment j to 


ij = 


compartment i. 


The change in mass over time of this compart- 
dx. 
ment, En = X;, must be balanced according to the 
t 


equation: 


where the constraints Fi > 0,u > 0, andr, > 0 
must be met for all i, j, and t. The time dependence 
is omitted in the notation for simplicity, but all 
masses and flows may change over time. 

An additional constraint for the system is that if 
the compartment is empty, no mass can flow out 
of it; i.e., if x, = 0, then r; = 0 and F¡=0 for all j, 
so that x; > 0. 

If the flows F are continuously differentiable, 
i.e., they change smoothly over time without sud- 
den jumps, we can define the flows as (Jacquez 
and Simon 1993): 


Therefore, we can write the mass balance equation 
for compartment i as: 


ži =—| bo + ò bj |x + > byx; + uj. 
jai jai 
The total outputs from compartment i can be 


expressed as b; =—| bo; +)», , then a general 
j#i 

expression for each compartment satisfies the 

expression: 


X= ) byX; + Uj. 
j 


58 COMPARTMENTAL DYNAMICAL SYSTEMS AND CARBON CYCLE MODELS 


A general expression for the entire system can 
be written as: 


x=Bx+u, (7.1) 
where the elements may be time-dependent and 
the matrix B and vector u depend on the vector of 
states x. Notice that in contrast to other chapters, 
we follow here a different notation and use B to 
denote a matrix. The system of Equation 7.1 is a 
compartmental or reservoir system, and the matrix B is 
called the compartmental matrix. 

For any compartmental system, the compart- 
mental matrix B has three properties: 


e b <0 foralli,t>0, 


i = 


e b 2 0 foralli#j,t>0, 


n 
> yb) = bitb; = 2; $0 for all j, t > 0. 

i=l iz) 
In words, the compartmental matrix B must 
always meet the requirement that all its diagonal 
entries are non-positive, its off-diagonal entries 
non-negative, and the sum of all elements inside 
each column must be non-positive. This col- 
umn sum represents the fraction of matter that is 
released from the system, and it is called the frac- 
tional release coefficient 7, because it can be used to com- 
pute the amount of material that is released to the 
external environment from each pool j. The total 
release from the system can be obtained with the 
expression: 


r=Zox, 


where z is the vector of fractional release coeffi- 
cients and o is the entry-wise product between the 
two vectors. 

The property —z < 0 implies that B is a diag- 
onally dominant matrix, which means that each 
element in the diagonal is greater than or equal to 
the column sum for this entry. Mathematically, B 


is diagonally dominant if fb; 


b, 


, and strictly 


E 
> 


bl. 
j#i 


One important property of strictly diago- 
nally dominant matrices is that they are invert- 
ible (Taussky 1949); i.e., there exists an inverse 
matrix B~! such that B : B-! = I, where I is the 
identity matrix. Compartmental systems that meet 
this property contain no traps (Jacquez and Simon 
1993); i.e., all mass that enters the system eventu- 
ally leaves from any of the output flows. 


diagonally dominant if |b; 


CLASSIFICATION OF COMPARTMENTAL 
SYSTEMS 


In the derivation of the compartmental system 
(Equation 7.1), the explicit representation of time 
dependencies and nonlinearities was omitted. We 
will now introduce a classification scheme for com- 
partmental systems based on these two properties, 
time dependencies (autonomy), and interaction 
among state variables (linearity). We call a model 
linear when the vector of inputs and the compart- 
mental matrix are not dependent on the vector of 
states, and nonlinear otherwise. Similarly, we call a 
model autonomous when the mass inputs and the 
compartmental matrix are not explicitly time depen- 
dent, and nonautonomous otherwise (Table 7.1). 

This classification scheme leads to four distinct 
groups of compartmental systems, each with spe- 
cific mathematical properties that we will explore 
in the following sections. 


Autonomous Versus Nonautonomous Systems 


In the autonomous case (Table 7.1), mass inputs and 
process rates in the system are constant. This implies 
that the external environment (e.g, solar radia- 
tion, air temperature, water content) are assumed 
constant. Although ecosystems are far from being 


TABLE 7.1 
Classification of carbon cycle models according to their dependence on the vector of states (linearity), and on time 
(autonomy). Table cells are expressions for the differential equation describing x(t) that captures the change of 
mass contents with respect to time 


x-dependence Autonomous Nonautonomous 
Linear u+B-x() u(t) +B(0) - x(t) 
Nonlinear u(x) + B(x) - x(t) u(x,t) + B(x,t) - x(t) 


CARLOS A. SIERRA 59 


surrounded by a constant environment, this assump- 
tion is sometimes useful to study basic properties of 
a system such as its long-term behavior. 

However, it is important not to mix up concepts 
that belong to autonomous systems with concepts 
that do not apply for nonautonomous systems. For 
instance, an autonomous compartmental system 
generally converges to a steady state in the long 
term where the mass of each compartment does 
not change with time. In contrast, a nonautono- 
mous system does not reach such a steady state 
because, by definition, the system is changing all 
the time. Therefore, it is wrong to talk about the 
steady state of a nonautonomous system (for addi- 
tional details see Sierra et al. 2018). 


Linear Versus Nonlinear Systems 


In the linear case, the contents of compartments 
do not influence the rates at which mass flows into 
the system from the external environment, and do 
not influence the rates at which mass flows out 
of the compartments (Table 7.1). In other words, 
there are no feedbacks among compartment con- 
tents. However, nonlinear behavior can occur in 
ecosystems, for instance, when the amount of pho- 
tosynthesis in the leaves depends on the amount of 
nonstructural carbohydrates or in fine roots. 
Nonlinear compartmental systems can show a 
very rich set of qualitative behaviors (Jacquez and 
Simon 1993; Anderson and Roller 1991), which 
for nonlinear autonomous systems range from sus- 
tained oscillations to catastrophic shifts to alternate 
states (Wang et al. 2014). In the nonlinear non- 
autonomous case, the time-dependent signals that 
affect the system introduce an even larger degree 
of complexity, which complicates the behavior of 
these systems further (Múller and Sierra 2017). 


PROPERTIES AND LONG-TERM BEHAVIOR OF 
AUTONOMOUS COMPARTMENTAL SYSTEMS 


Even though the assumption of a constant envi- 
ronment is unrealistic, autonomous models can 
be very useful in illustrating potential behavior 
of compartmental systems. In the following, we 
will present a few properties of autonomous sys- 
tems that are useful for many applications, which 
include: long-term behavior of stocks and fluxes, 
behavior in the neighborhood of the steady state 
after a perturbation, the age structure of the 


compartments and the release flux, and the behav- 
ior of an impulsive tracer. 


Linear Systems 
We will consider first linear autonomous compart- 
mental systems of the form 


x(t)=u+B-x(t), (7.2) 


with B invertible and some initial conditions at 
t=0 


x(0) = Xp. 


One advantage of systems of the form of 
Equation 7.2 compared to the other systems in 
Table 7.1, is that it is possible to compute their 
analytical solution. The general solution of this 
model is given by: 


where e is the matrix exponential. 

Equation 7.3 shows that the solution of the 
system is composed of two terms. The first term 
accounts for the decomposition of the mass ini- 
tially stored in the system at time zero. The second 
term accounts for decomposition of the inputs 
that entered the system until time t. At any given 
time, the mass stored in the system is the sum of 
both the remaining of the initial mass present at 
time t = 0 and all the un-decomposed mass that 
entered until time t. 

The release of mass from the system is com- 
puted by multiplying the fractional release coef- 
ficients z, by the amount of carbon stored in each 
pool as: 


If the system runs for a very long time, it even- 
tually reaches a point called the steady state where all 
inputs are equal to the outputs, and there are no 
changes in mass within the system. Technically, as 
t> + œ, x(t) > x*, where: 
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(7.4) 


and 


Notice that the steady state does not depend 
on the initial conditions. It only depends on the 
compartmental matrix and the vector of exter- 
nal inputs, and represents the equilibrium point 
where the total amount of matter in the system 
and in the individual pools do not change, i.e., 
x =0,and x=0, respectively. 


Nonlinear Systems 


In contrast to linear systems, nonlinear compart- 
mental systems have no general explicit analytical 
solution. However, it is always possible to obtain a 
numerical solution of the system using any suit- 
able numerical method (LeVeque 2007). 

In most applications, we are interested in observ- 
ing how the system evolves over time and eventu- 
ally reaches a steady state. Therefore, it is of interest 
to find an equilibrium solution for the system: 


(7.5) 


x= u(x)+B(x) "X, 
such that: 


0=u(x)+B(x)-x. (7.6) 
However, it is not certain that a specific non- 
linear system has an equilibrium solution, or in 
case there is one, that this equilibrium is unique. 
Anderson and Roller (1991) show special cases of 
nonlinear compartmental systems with constant 
inputs that have unique solutions, but these cases 
are too specific for our purposes here. 
Certain combinations of parameter 
ues and pool sizes may lead to the situation in 
which the matrix B(x) is not compartmental, 
and therefore the system may not be mass bal- 
anced. For this reason, it is useful to define a 
space in which a nonlinear system is well defined. 
Following Anderson and Roller (1991), we define 
RE Sis eR':x> 0) as the set of all non- 
negative real numbers in an n-dimensional space. 


val- 


Since the mass in all compartments is always 
non-negative, the solutions of the system can 
only occupy this space. Now we define the space 
within R} where all solutions of the system obey 
mass balance constraints as: 


Q:= {x e R}:B (x)is a compartmental matrix}. 


The space @ is the set of all possible states the 
system can take without violating mass balance. 
One important use of Q is that it can be used to 
test whether a particular nonlinear model does not 
violate mass balance for any value of x and t. 

For the case of constant inputs, i.e., u, Anderson 
and Roller (1991) propose an iteration strategy to 
find a steady-state solution for a nonlinear autono- 
mous system. It consists of applying the formula: 


x’! = -B(x') © u, p=0,1,2,..., 


until x? +! 7 x. Notice that for this method to 
work, the compartmental matrix must be invert- 
ible. Also, the existence of one equilibrium point 
is not a guarantee that it is unique: other equi- 
libria may exist as well. The choice of the starting 
x’ = 0 may determine what equilibrium point the 
method will find. 


Stability Analysis Near Equilibria 


In many applications, it is of interest to study the 
behavior of a system as it approaches an equilib- 
rium point, or the behavior of the system when 
it is slightly perturbed from this equilibrium. The 
study of these behaviors usually falls under the 
label stability analysis. Again, the stability analysis 
would differ depending on whether the autono- 
mous system is linear or nonlinear. 


Linear Systems 


For linear autonomous compartmental systems 
(Equation 7.2), their long-term behavior can be 
studied by analyzing the eigenvalues and eigen- 
vectors of the compartmental matrix B. It is well 
established that a compartmental matrix with con- 
stant coefficients has no eigenvalues with posi- 
tive real part, which means that the mass inside 
the compartments never grows exponentially as 
long as inputs and rates are kept constant. This is 
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ensured by the diagonally dominant property of 
the compartmental matrix. 

In most applications, the eigenvalues of the 
linear autonomous compartmental matrix have 
a negative real part. In these cases, it is said that 
the compartmental system is asymptotically stable 
because all solutions converge in the long-term to 
the steady state of Equation 7.4. If the eigenvalues 
also contain a complex part, then the solution will 
approach the steady state through oscillations. If 
the eigenvalues contain no complex part, then the 
system approaches the steady state in the direction 
given by the eigenvector of the eigenvalue with 
the smallest absolute value of the real part. 

A third possibility is that the compartmental 
matrix contains at least m eigenvalues with a real 
part equal to zero. In this case, it is said that the 
compartmental system contains m traps (Jacquez 
and Simon 1993). A trap is a compartment, or a 
set of connected compartments, where mass may 
flow in but cannot flow out. In this case, the sys- 
tem contains no equilibrium since B is not invert- 
ible and Equation 7.4 cannot be solved. The system 
therefore, will grow proportionally to the amount 
of mass entering the m traps. 


Nonlinear Systems 


For nonlinear systems, it is common to study the 
behavior of the system in the neighborhood of 
one or multiple equilibrium points. For compart- 
mental systems, we are only interested in equi- 
libria that reside in the space Q, since they are 
the only ones that have a physical and biological 
interpretation. 

We assume that the nonlinear autonomous sys- 
tem of Equation 7.5 has at least one equilibrium 
point in Q, then we are interested in calculating 
the Jacobian matrix, defined as: 


at an equilibrium point x = x* € Q. This Jacobian 
matrix tells us about the behavior of trajectories 
that are close to the steady state, which is a point 
in the phase plane. Then, the properties of the 
Jacobian matrix, particularly its eigenvalues, tell 
us about the stability of the system in the neigh- 
borhood of the equilibrium (Guckenheimer 
and Holmes 1983). It is possible to treat the 


nonlinear system as a linear system in the neigh- 
borhood of the equilibrium, and for this reason 
one can perform the same analysis of eigenvalues 
as in the linear case (Guckenheimer and Holmes 
1983). 

If there are eigenvalues with positive real part, 
trajectories are repelled away from the equilibrium 
point, which is considered unstable (Strogatz 1994). 
The existence of unstable equilibria is an indication 
of possible tipping points and alternative states for 
the system (Scheffer et al. 2001). However, it is 
often the case that the Jacobian matrix of a com- 
partmental system is also a compartmental matrix, 
in which case the existence of unstable equilibria 
is excluded. 

When this Jacobian matrix has a compartmen- 
tal structure, the system is said to be cooperative, 
which means that if the mass of one compart- 
ment increases, the fluxes to other compartments 
also increase (Jacquez and Simon 1993). In this 
case, trajectories close to the equilibrium point 
are attracted to it, and in some particular cases this 
equilibrium may be unique (Jacquez and Simon 
1993; Bastin and Guffens 2006). This particular 
case of a unique equilibrium point means that the 
system is global asymptotically stable or GAS (Miller 
and Sierra 2017). 


PROPERTIES AND LONG-TERM BEHAVIOR OF 
NONAUTONOMOUS SYSTEMS 


Nonautonomous compartmental systems behave 
in a completely different way to autonomous sys- 
tems. Since the mass inputs and the rates change 
with time, it is not possible for them to converge 
to a fixed point in the state space. Also, the stabil- 
ity analysis tools for autonomous systems are of 
little use for nonautonomous systems. Methods 
to analyze nonautonomous systems are relatively 
new, and they are currently an active branch of 
mathematical research (Rasmussen 2007; Kloeden 
and Rasmussen 2011). Concepts from control 
engineering can also be very useful to study non- 
autonomous systems, particularly nonlinear ones 
(Sontag 1998). Again, we will split the concepts 
for linear versus nonlinear nonautonomous sys- 
tems in the sections below. 


Linear Systems 


We will consider two cases for linear autono- 
mous compartmental systems: (1) the case of 
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time-dependent inputs and constant rates, and (2) 
the case of time-dependent inputs and rates. 
The first case is given by a system of the form: 


x(t)=u(t)+B-x(t), 


with initial condition x(0) = x . If the vector- 
valued function u(t) is known, an analytical solu- 
tion can be obtained as: 


t 


x(t) =e "x, + fae -u(7)dr, 


0 


which is a general form for the linear autonomous 
solution of Equation 7.3.This analytical solution is 
only possible to compute because the rates in the 
compartmental matrix B are constant for all times, 
and therefore one can take advantage of the ana- 
lytical properties of the matrix exponential. 

For the second case, when both mass inputs and 
rates are time dependent, the system is expressed as: 


x(t)=u(t)+B(t)- x(t), 


for which an analytical solution cannot be com- 
puted. However, a semi-explicit solution for 
Equation 7.7 can be expressed in terms of the state 
transition operator D(t,t,), which is a matrix whose 
product with the state vector at an initial time to 
gives x(t) at a later time t. In other words, P(t, ty) - 
x, is the solution to the homogeneous equation 
x= B(t) x. 

The semi-explicit solution of the linear nonau- 
tonomous system of Equation 7.7 can be expressed 
as: 


(7.7) 


x (t,to,Xo ) = O(t,to) “Xo fotr) - u(7)ăr. (7.8) 


to 


This solution explicitly depends on the initial 
conditions since for a nonautonomous system, 
where mass inputs and rates constantly change 
with time, the exact time and state when the sys- 
tem starts is of fundamental importance to com- 
pute a unique solution. In the autonomous case, 
solutions only depend on the time elapsed t — to, 
while in the nonautonomous case the solutions 
depend separately on the actual time t and the 
starting time t, (Kloeden and Rasmussen 2011). 


Rasmussen et al. (2016) presents a sufficient 
condition for the global exponential stability of 
the nonautonomous linear compartmental sys- 
tem. If the compartmental matrix B of the homo- 
geneous system x = B(t) -x is strictly diagonally 
dominant for all t, then this system is exponen- 
tially stable. This means that there is a minimal 
rate at which the initial mass in the system decays. 
Now, for the inhomogeneous case (Equation 
7.7), we can think of two solutions s, (t,t,,X,) and 
s,(t, t, X,) that have different initial conditions. As a 
consequence of the exponential stability property, 
the two solutions are said to be forward attracting, i.e. 
they get close to each other as t > + oo. 

Rasmussen et al. (2016) also showed that for 
linear nonautonomous compartmental systems 
that meet the sufficient condition for exponen- 
tial stability, there exists a unique pullback attracting 
solution or pullback attractor which all solutions are 
attracted to. It is defined as: 


v(t) := fo (tr) u(t)d, 


and can be interpreted as the solution that has no 
influence whatsoever from the initial conditions 
(Kloeden and Rasmussen 2011). Therefore, the 
pullback attractor is the nonautonomous equiva- 
lent of the steady-state concept for autonomous 
systems (Carvalho et al. 2013). 

A particular case is the linear nonautonomous 
system in which the mass inputs and the process 
rates are periodic. For example, this is the case of 
seasonal systems without noise in which the same 
periodic pattern for the mass inputs and for the 
process rates is repeated every year. More precisely, 
a periodic linear compartmental system is one in 
which u(t + T) = u(t) and B(t + T) = B(t) for a 
fixed period T and for all t. Mulholland and Keener 
(1974) showed that these types of systems have 
periodic solutions for which x(t + T) = x(t). This 
periodic solution can be interpreted as a pullback 
attractor because it has no influence on the initial 
conditions. 


Nonlinear Systems 


Nonlinear nonautonomous compartmental sys- 
tems are the most complex cases for their study and 
analysis. It is not possible in general to obtain ana- 
lytical solutions, and, contrary to the autonomous 
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case, it is not possible to study an equilibrium 
point for these systems because, by definition, 
compartment contents are always changing and 
they never reach a constant value. 

As mass inputs, and process rates change in a 
nonlinear nonautonomous compartmental sys- 
tem, it is possible that specific combinations of 
parameter values and compartment sizes lead the 
system outside the space 2 where mass balance 
consideration must be met. Therefore, it is always 
important to check that solutions for these systems 
are always inside this space; i.e. x(t, tọ, X) € 2 for 
all t, where x(t, tọ, Xọ) is a solution trajectory of the 
nonlinear nonautonomous compartmental system 
of the form: 


x(t)= u(x(t),t)+B(x(t),t) -x(t). 


Concepts from control theory could be used to 
ensure that solutions are well behaved and inside 
Q, and more importantly, within certain ‘regions 
of stability’ that solutions are attracted to (Müller 
and Sierra 2017; Kloeden and Rasmussen 2011). 

Input-to-state stability (ISS) is a concept from 
the field of control theory that can be used to 
determine whether a nonlinear nonautonomous 
compartmental system meets stability properties. 
We say that a dynamical system is ISS if it is glob- 
ally asymptotically stable in the absence of time- 
dependent perturbations, and if its trajectories are 
bounded by a function of the size of the input for 
all sufficiently large times (Sontag 1998; Müller 
and Sierra 2017). Therefore, we can expect the tra- 
jectories of an ISS system to remain within a cer- 
tain region as long as the initial mass decays over 
time, and the mass inputs stay bounded within a 
certain limit. 

We expect that for most applications, nonlinear 
nonautonomous compartmental systems meet the 
properties of ISS systems. However, mathematically 
showing that a system is ISS is not trivial, and this 


(7.9) 


should be studied on a case-by-case basis (Sierra 
and Müller 2015; Müller and Sierra 2017). 


FINAL REMARKS 


The theory of compartmental dynamical systems 
offers a formal theoretical framework to express and 
analyze models of the carbon cycle and other biogeo- 
chemical elements that meet mass balance require- 
ments. Using a matrix representation of carbon 
storage in ecosystem pools, it is possible to use the 
theory of compartmental dynamical systems to study 
important characteristics of models such as their 
long-term behavior, the presence of traps that retain 
carbon indefinitely in a model, and the response of 
ecosystem compartments to disturbances. 

The representation of ecosystem models as 
compartmental systems is also useful to study sys- 
tem level properties of ecosystems (see Chapter 
15). It is a useful mathematical representation that 
can relate ecosystem concepts to formal math- 
ematical properties of dynamical systems. 


SUGGESTED READING 


General introductions to compartmental systems 
can be found in the monograph by Anderson 
(1983), and the comprehensive review of Jacquez 
and Simon (1993). More specific results about the 
application of compartmental systems to model 
the terrestrial carbon cycle can be found in the ref- 
erence list and other chapters of this book. 


QUIZZES 


1. According to the general classification of models 
with respect to their dynamical properties, what 
type of compartmental systems have a fixed- 
point steady-state? 


2. Can linear compartmental systems show transi- 
tions through tipping points? 


3. What is the analogue of a steady state for nonau- 
tonomous systems? Why? 


64 COMPARTMENTAL DYNAMICAL SYSTEMS AND CARBON CYCLE MODELS 


CHAPTER EIGHT 


Practice 2 


MATRIX REPRESENTATION OF CARBON 
BALANCE EQUATIONS AND CODING 


Yvanyuan Huang 


Climate Science Centre, CSIRO, Canberra, Australia 


CONTENT 


Introduction / 65 


This practice helps you to learn how to write a 
matrix equation from carbon balance equations 
for the CENTURY model. You will also learn how 
to numerically solve the matrix equation through 
Python code using the CarboTrain package. 


INTRODUCTION 


In Chapter 5 we saw that a key step to develop 
the matrix approach to land carbon cycle model- 
ing is to represent a land carbon cycle model in 


EXERCISE 1: Deriving the matrix model 
from carbon balance equations 


In this exercise we will develop a matrix form 
of the CENTURY model from the carbon bal- 
ance equations. If you performed Exercise 1 
in Chapter 4, you have written carbon balance 
equations of the CENTURY model, whose car- 
bon flow diagram is shown in Figure 4.1. 

The carbon cycle as depicted in the carbon 
flow diagram in Figure 4.1 can be represented 
by five carbon balance equations as below: 


a 
dt 
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a matrix form. This can be achieved by organiz- 
ing the carbon balance equations of a model into 
one matrix equation. For many common models, 
the matrix version is mathematically identical to 
the original model without any changes in repre- 
sented processes. Conversely, a matrix equation can 
be expanded row by row to give carbon balance 
equations for individual carbon pools. Exercise 1 
focuses on how to derive the matrix version from 
the carbon balance equations of a carbon cycle 
model with multiple pools. Exercise 2 provides 


2 =1+ P2=K20 
t 


us = f3 1 * klx1 + {32 *k2x2 
t 


+ f34 * k4x4 + [35 * k5x5 — k3x3 


dci = 741% k1x1+1f43 * k3x3 — k4x4 


dt 


= = [53% k3x3 + £54 * k4x4 — k5x5 
t 
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where x1, x2, x3, x4, and x5 correspond to 
structural carbon, metabolic carbon, active soil 
carbon, slow soil carbon, and passive soil car- 
bon pools. I is the input rate from plant resi- 
due. $1 and fl indicate the proportion of this 
carbon input that is allocated to structural and 
metabolic litter. fij are fractions of carbon trans- 
ferred from pool j to pool i.k1 to k5 are rates of 
soil organic carbon decomposition. 

To develop a matrix model from the carbon 
balance equations, we need to know the fol- 
lowing items. The first item is the general form 
of the matrix equation. If you do not remem- 
ber it, please go back to Chapter 5 to find the 


EXERCISE 2 Coding and running the 
matrix model 


This exercise uses the package CarboTrain, 
which enables you to do exercises in this 
and other units with minimal background 
training in modeling and programming. 
Instructions for installing and working with 
CarboTrain are available in Appendix 3 of 
this book. The following steps will guide 
you through the coding and running of the 
matrix version of the TECO model using the 
CarboTrain package. 


practice in coding and running the matrix model 
using Python through the CarboTrain package. 
Through coding, you will find some benefits of 
working with the matrix model, such as being 
easy to code and run model simulations. 
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Step 1: Run the default source code. 


In the main window of CarboTrain, select 
Unit 2 and Exercise 2, specify the path of the 
output directory in which you wish to store 
your model result, click on Run Exercise. 
The notification “Task submitted!” will pop 


general form of the matrix equation. Second, 
you need to remember what the scalar, vectors, 
and matrices are in the general matrix equation. 


dX 
For example, is En a scalar, vector or matrix? 
t 


What does it mean? Please list the scalar, vectors 


and matrices in the general matrix equation. 
Third, you need to write all the scalar elements 
of these vectors and matrices. For example, X 
= [x1 x2 x3 x4 x5]" indicates a vector of pools 
related to structural carbon, metabolic carbon, 
active soil carbon, slow soil carbon and pas- 
sive soil carbon. What are the other vectors and 
matrices of the matrix model? 


up. Click on OK. Wait until the notification 
“Finished” appears, then click on OK. Go to 
the output directory you specified. There are 
two files. One is result.png, which plots the 
change of seven carbon pools through time. 
A second file is output.csv, which saves the 
value of each carbon pool in each simula- 
tion year. 


Step 2: Understand the default source code. 


In the CarboTrain main window, click on 
Edit source code. You will see the following 
source code (black text in your GUI). The 
code is based on the matrix model of TECO. 
The first part of the code loads packages and 
environment, and reads the output path you 
specified in the previous step. 


The second part of the code, shown 
below, constructs the matrices and specifies 
our desired simulation length in years. The 
vector iv_list is the initial carbon pool size, 
input_fluxes specifies the input rate, B is the 
allocation vector, A is the transfer matrix, 
and K is the turnover rate matrix. 


The third part of the code (Figure 8.3) calls 
the function GeneralModel to solve the carbon 
balance equations and generate the result. 


# You specify values of Initial carbon pool size (iv List), Input rate (input_fluxes), 
# Allocation (B), Transfer (A), Turnover rate (K), and simulation Length, 

# GeneralModel solve the carbon balance equations and generate the result 

mod = GeneralModel(times, B, A, K, iv_list, input_fluxes) 


res = mod.get_x() 
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from GeneralModel import GeneralModel 
import numpy as np 

import matplotlib.pyplot as plt 
import sys 


if __name__ == ' main ': 


output_folder = sys.argv[1] 


Figure 8.1. First part of the source code of the matrix ver- 


sion of TECO. 


B = np.array([0.45, 0.55, 0, 0, 0, 0, 0]).reshape([7,1]) 


= 0.72; 41 = 0,28; f42 = 
= 0.296; £75 = 0.004; £56 
A = np.array([-1, 0, 0, 0, 0, 0, 0, 
0, -1, ©, 0, 0, 0, 0, 
f31, 0, -1,0,0,0,0, 
fal: f42 0, "1, 0; 0;. (0); 


0, 0, £53, f54, -1, f56, f57, 


0, 0, 0, f64, f65, -1, 2, 


0, 0,0, 0, f75, f76, -1]).reshape([7,7]) 


The fourth part of the code (Figure 8.4) 
generates the output figure (result. 
png) and saves results into the csv text file 
(output .csv). 


Step 3: Modifying the default source code. 


To get a sense of how the code works and 
what controls the system dynamics, you 
could change some of the default values in 
the second part of the code. For example, if 


# al location 


1; £53 = 0.45; f54 = 0.275; f64 = 0.275; 
= 0.42; £76 = 0.03; f57 = 0.45; 


# tranfer 


turnover rate per day of pools: foliage, wood, metabolic litter, structural 


#litter, soil microbial,slow soil, passive soil 


temp = [0.00176, @.000100104, 0.021468, 0.000845, 0.008534, 8.976e-005, @.00000154782] 


K = np.zeros(49).reshape([7, 7]) 


for i in range(0, 7): 
K[i][i] = temp[i] 


#Unit of turnover rate from day”-1 to second”-1 


#1 day = 86400 seconds 
K = np.multiply(K, 1/86400) 


# Cinput_const 
input_fluxes = 0.00002245 4 


nyear = 10000 # number of simulation years 


times = np.linspace(0, nyear*365*86400, num 


iv_list = [0,0,0,0,0,0,0] 


Figure 8.2. Second part of the source code of TECO. 


# Plot carbon pools and save results into csv file 
fig = plt.figure(6*2, figsize=(14, 7.68)) 


nyear) 


# Initial carbon pool size 


plt.subplots_adjust(left = 0.1, right = 0.95, bottom = 0.10, top = 0.9, wspace =0.3, hspace =0.4) 


x = list(range(1,nyear + 1, 1)) 


pool_names = ["foliage", "wood", "metabolic litter", "structural litter", "soil microbial", "slow soil", "passive soil"] 


for i in range(1, 4): 
for j in range(1, 4): 
if ((i-1) *3+j)>7: 
break 
ax = plt.subplot(3, 3, (i-1) * 3 + j) 
ax.plot(x, res[(i-1) * 3 + j - 1,:]) 
plt.xlabel("year", fontsize = 12) 


plt.ylabel(pool_names[(i-1) * 3 + (j-1)] + " pool ($g C m*{-2}$)", fontsize = 12) 


plt.savefig(output_folder + "/result" + ".png", dpi = 500) 


*fplt.show() 


print(res[:,nyear-1]) # print result of the Last year 


mod.write_output (output_folder + "./output.csv") 


Figure 8.3. Call to the functions that solve the carbon balance equations and retrieve the result. 
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from GeneralModel import GeneralModel 
import numpy as np 

import matplotlib.pyplot as plt 
import sys 

af name ==" mains 


output_folder = sys.argv[1] 


B = np.array([0.4, 0.35, 0.25, 0, ©, 0, 0]).reshape([7,1]) 


-h 
p 
e 
i} 


= 0.7; £51 = 0.25; £42 
#64 = 0.3; f65 = 0.05; £75 


9.094; 


A = np.array([-1, 0, 0, 0, 0, 0, 0, 
e, =1, 60, 6, 8, 0, 0, 
9; 09; -1 6, 0, 0, 8; 


f41, f42, f43, -1, 0,0, 0 
e, e 


F51, 152, f53, 0, -1, 0, 
0, 0, 0, f64, f65, -1, 0, 


0, 0, 0, 0, f75, 0, -1]).reshape([7,7]) 


#turnover rate per day of pools: 


# allocation 


0.65; £52 = 0.25; £43 = 0.15; £53 = 0.75; 


# tranfer 


#leaf,root,wood, metabolic Litter, structural Litter, fast SOM, passive SOM 
temp = [0.0017, 0.002, 0.0001, 0.01, 0.001, 0.0001, 8.000001] 


K = np.zeros(49).reshape([7, 7]) 


for i in range(0, 7): 
K[i][i] = temp[i] 


#Unit of turnover rate from day”-1 to second”-1 


#1 day = 8640@ seconds 
K = np.multiply(K, 1/86400) 


# Cinput_const, assume to be constant 
input_fluxes = 0.00002245 + 


nyear = 10000 


times = np.linspace(0, nyear*365*36400, num 


iv_list = [0,0,0,0,0,0,8] 


= nyear) 


Figure 8.4. Code segment that generates visual output and saves results to a text file. 


we change the number of simulation years 
(nyear) to 100, how does this affect the 
output from the run? If we make the pas- 
sive soil carbon turnover faster by changing 
the default value of 0.00000154782 per 
day to 105 (1e-5 in Python) per day, what 
difference do you see in comparison with 
the default? You could explore around and 
understand how different parts of the matri- 
ces control the system dynamics. Every time 
you modify the code through Edit source code, 
please make sure you save the code before 
you click on Run Exercise. It is good practice 


to change one place at a time and change 
the value back to the default value after you 
finish the practice. 


Step 4: Building a new carbon model (optional). 


Suppose we have a system with leaf (pool 
1), root (pool 2), wood (pool 3), metabolic 
litter (pool 4), structural litter (pool 5), fast 
soil organic matter (pool 6), and passive soil 
organic matter (pool 7). Carbon dynam- 
ics of the system are given by the matrix 
equation below. Based on the default TECO 
model source code, above, are you able to 
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code this matrix model and check on carbon 
dynamics through time? 


X — pr + AKX 
dt 


B= 0 
0 
0 
L 0 a 
[-1 0 0 0 o 0] 
0 <= 0 0 0 0 0 
0 0 -1 0 0 0.0 
A=|f41 [42 f3 -1 0 0 0 
51 £52 f53 0 -1 0 0 
0 0 0  f64 f65 -1 0 
| 0 0 0 0 {75 0 =| 


kl 0 0 0 0 0 
0 k 0 0 0 0 
0 0 K 0 0 0 
K=[0 0 0 k4 0 0 0 
0 0.0 k5 0 0 
0 0.0 O k 0 

[o o 0 o k7| 


We assume 40% of the net photosynthetic carbon 
is allocated to leaf, 35% to root and the remaining 
to wood. f41=0.7; f51 = 0.25; f42 = 0.65; f52 
= 0.25; f43 = 0.15; f53 = 0.75; f64 = 0.3; f65 
= 0.05; f75 = 0.04; k1 to k7 are [0.0017, 0.002, 
0.0001, 0.01, 0.001, 0.0001, 0.000001] per day. 
For other parameters, if not specified above (e.g., 
the input rate), we assume they take the same val- 
ues as the default TECO model. Thus, you do not 
need to change these values. 

Does your new matrix model work? Can you 
run it? What results do you get? 

Click on Open solutions in CarboTrain to check if 
the source code of your new matrix model is simi- 
lar to the code shown in Figure 8.5. 


mod = GeneralModel(times, B, A, K, iv_list, input_fluxes) 


res = mod.get_x() 


fig = plt.figure(6*2, figsize=(14, 7.68)) 


plt.subplots_adjust(left = 0.1, right = 0.95, bottom = 09.10, top = 0.9, wspace =0.2, hspace =0) 


x = list(range(1,nyear+1, 1)) 


pool_names = ["foliage", "root","wood", “metabolic litter", "structural litter", “fast soil", “passive soil”] 


for i in range(1, 4): 
for j in range(1, 4): 
SF ((1-1) '* 3+ 3/97 5 
break 
ax = plt.subplot(3, 3, (i-1) * 3 + j) 
ax.plot(x, res[(i-1) * 3 + j - 1,:]) 
plt.xlabel("year", fontsize = 12) 


plt.ylabel(pool_names[(i-1) * 3 + (j-1)] + " pool ($g C m*{-2}$)", fontsize = 12) 
plt.savefig(output_folder + "/result" + ".png", dpi = 500) 


*plt.show() 
print(res[:,nyear-1]) 


#mod.write_output("./output.csv") 


Figure 8.5. Solution to Step 4. 
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All model intercomparison projects (MIPs) have 
shown large uncertainties in prediction of carbon 
sequestration among models and poor model- 
data matches. Although great efforts have been 
made, it is still difficult to identify causes of model 
uncertainty. This chapter offers a unified diagnos- 
tic system, which is also called a 1-3-5 scheme 
of diagnostics, for uncertainty analysis in carbon 
cycle modeling. The number 1 stands for one for- 
mula to unify the land carbon models, the num- 
ber 3 is for one three-dimensional (3D) space to 
evaluate all model outputs, and the number 5 is for 
five traceable components to pinpoint uncertainty 
sources via traceability analysis. 


UNCERTAINTY IN LAND CARBON CYCLE 
MODELING 


Hundreds of land models have been developed to 
predict the future state of ecosystems in an attempt 
to inform management practices for climate change 
mitigation. However, all model intercomparison 
projects (MIPs) have shown large uncertainties in 
prediction of carbon sequestration among models 
and poor model-data matches (Friedlingstein et al. 
2006, 2014, Luo et al. 2015). For example, eleven 
earth system models (ESMs) participating in the 
Coupled Model Intercomparison Project Phase 5 


DOI: 10.1201/9780429155659-12 


(CMIP5) perform poorly in predicting the distri- 
bution of global land surface soil carbon (Todd- 
Brown et al. 2013). Similarly, the spread among 
11 ESMs participating in the subsequent CMIP6 
exercise has not changed significantly from CMIP5 
results (Arora et al. 2020). The model predictions 
are an order of magnitude more uncertain over 
land than over ocean. Regionally, terrestrial eco- 
system models did not accurately characterize a 
wide range of vegetation functional traits associ- 
ated with net primary productivity (NPP) in the 
East Asian monsoon area (Cui et al. 2019). 

Great efforts have been made in the past decades 
to identify causes of model uncertainty. For exam- 
ple, model development teams add more and more 
processes into land carbon models in the hope of 
representing ecosystem processes more realistically. 
This practice yields mixed results. Incorporation of 
the nitrogen cycle into more models has been sug- 
gested to reduce the spread of CMIP6, whereas dif- 
ferent treatments of processes in permafrost regions 
results in divergent model prediction. In general, 
increasing details in process representation in mod- 
els hinders our understanding of holistic system 
behavior (Sierra et al. 2018). Benchmark analysis 
has been used to evaluate model performance skills, 
quantify model-data mismatches, and identify pro- 
cesses that need to be improved (See Chapter 19). 
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Figure 9.1. The unified diagnostic system with one (1) equation, three dimensions (3D), and five (5) variables. 


Data assimilation has been proposed to improve 
data-model consistency (See Chapter 37). None of 
the approaches, however, help full understanding 
of the causes of model uncertainty. Without identi- 
fying sources of uncertainty, it is extremely difficult 
to focus model improvement efforts to realistically 
predict future carbon dynamics. 

This chapter introduces a 1-3-5 scheme as a 
unified diagnostic system for uncertainty analy- 
sis (Figure 9.1). This diagnostic system is based 
on the matrix approach to representation of land 
carbon cycle models in one (1) general equa- 
tion without compromising any details of process 
representation. Land carbon cycle dynamics are 
defined by a three-dimensional (3D) space with 
axes of coordinates being carbon input, residence 
time, and carbon storage potential. Model uncer- 
tainty can be traced to five (5) variables, which are 
carbon input, plant partitioning, decomposition, 
carbon transfer, and environmental scalars. Thus, 
the 1-3-5 scheme provides an analytic framework 
to understand the structure of complex models, 
their dynamic behavior, and uncertainty. 


ONE FORMULA TO REPRESENT LAND CARBON 
CYCLE MODELS 


The unified diagnostic system offered by the 
matrix approach is based on one formula to unify 
land carbon cycle models. In the previous two 
training units, we have shown you that the matrix 
equation can unify the land carbon cycle models. 

We have converted 18 models to the matrix 
equations. These models are CLM3.5, CLM4, 
CLM4.5, CLM5, CABLE, LPJ-GUESS, IBIS, CASA’, 
CENTURY, ORCHIDEE, TEM, TECO, DELAC2, ELM, 
GECO, FBDC, BEPS, and YASSO. In these models, 
we reorganize the original carbon balance equa- 
tions into one matrix equation without changing 
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any processes. For example, CLM5 uses hundreds 
of carbon balance equations to simulate carbon 
transfer among 18 plant pools of 17 vegetation 
types and 140 soil pools over 20 soil layers. These 
carbon balance equations are organized into one 
matrix equation for the 18 vegetation pools and 
one matrix equation for the 140 soil pools (Lu 
et al. 2020). The vegetation and soil matrix equa- 
tions are connected through litterfall and can be 
integrated into one matrix equation. 

Sierra and Müller (2015) review all soil car- 
bon models according to the principles underly- 
ing models. The matrix equation we have focused 
on in this training course satisfies five principles: 
mass balance, substrate dependence, heteroge- 
neity of decomposition rates, transformation of 
organic matter, and environmental effects. When 
decomposition rates or transfer rates are functions 
of substrate, we still can use a matrix equation to 
describe carbon dynamics. In this case, the matrix 
equation becomes a nonlinear model. Indeed, 
Carlos Sierra's group from Max Planck Institute 
of Biogeochemistry, Germany, uses the nonlinear 
matrix models to represent more than ten microbial 
models. Thus, we can understand models at differ- 
ent levels of complexity under one overarching the- 
ory. Please be mindful that nonlinear matrix models 
will have different properties from the linear ones. 

With one general formula to represent all the 
models, we can seek to understand the general 
behavior of the land carbon cycle and diagnose 
model uncertainty on a common ground. For 
example, the widely used model CLM5 includes 18 
plant pools for each of the 17 vegetation types and 
140 soil pools (i.e., 7 pools per layer over 20 lay- 
ers) (Lu et al. 2020). Thus, CLM5 simulates carbon 
transfer among 158 pools (i.e., 140 for soil + 18 for 
plant) in one grid-cell if it is occupied by one veg- 
etation type and 194 (i.e., 140 + 3 x 18) pools if 
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Figure 9.2. Model structure and parameterization of CABLE (a) and CLM3.5 (b). Carbon enters the system through photosyn- 


thesis and is partitioned among live plant pools. From live pools, carbon is transferred to litter pools, and then to soil carbon 


pools. Values in boxes show the pool residence times. Values outside the boxes show the partitioning and transfer coefficients. 
Abbreviation CWD for coarse woody debris, Str. for structural litter, Met. for metabolic litter, Surf. for surface litter, micr. for 


microbial biomass, and SOM for soil organic matter (From Rafique et al. 2016). 


one grid-cell is occupied by three vegetation types. 
As a consequence, the matrix equation has at least 
158 elements in the pool vector. In contrast, the 
CABLE model has nine pools so that the pool vec- 
tor has nine elements (Xia et al. 2012). Despite the 
different numbers of carbon pools treated by these 
two models, they share the fundamental structure 
represented in the matrix equation, a structure that 
is shared by all land carbon cycle models. 

Besides the differences in pool numbers, each of 
the five components of the matrix equation (i.e., 
equation 1.6 in Chapter 1) is represented differently 
either by fixed values, functions, or nested mod- 
els. The differences in parameterization of the five 
components also contribute to different simulation 
results. For example, CABLE allocates 61% of NPP to 
roots, 23% to wood and 16% to leaves (Figure 9.2a) 
whereas CLM3.5 allocates 43% of NPP to leaves, 
16% to wood and 41% to roots (Figure 9.2b) 
(Rafique et al. 2016). Similarly, a large difference 
exists in carbon transfers from live plants to litter 
and soil. CABLE transfers dead tissues into three lit- 
ter pools (including coarse woody debris, CWD) 
after senescence, whereas CLM3.5 transfers dead 
plant tissues to six litter pools (including CWD) 
after mortality. CLM3.5 and CABLE also differ in 
representing decomposition (i.e., K matrix in Figure 
9.1). While the two models can be both represented 
by one matrix formula, CLM3.5 realizes each of the 


YIQI LUO 


five components differently from CABLE, leading 
to different model projections of carbon dynam- 
ics. Despite these differences in model structure and 
parameterization, all these models can be repre- 
sented by the same matrix equation. 


ATHREE-DIMENSIONAL (3D) SPACE TO 
DESCRIBE MODEL OUTPUTS 


Now let us explore what the 3D space means for 
evaluating model outputs. Chapter 1 explains that 
we need three variables: carbon input, residence 
time, and carbon storage potential to describe the 
transient dynamics of the land carbon cycle. The 
three variables become three dimensions to form 
a 3D space that can place model outputs, no mat- 
ter how differently models are executed, for evalu- 
ation on a common ground. The mathematical 
foundation for using the 3D space to evaluate the 
transient dynamics of the carbon cycle is presented 
in detail in a paper by Luo et al. (2017). Here is a 
description of each of the three dimensions. 

The first dimension of transient dynamics is 
carbon input through primary production. The 
primary production, being either gross primary 
production (GPP) or net primary production 
(NPP), quantifies the amount of carbon that enters 
an ecosystem to go through a variety of processes 
of the land carbon cycle before being released back 
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to the atmosphere. Of the three variables, carbon 
input via GPP or NPP has been studied most. 

The second dimension of transient dynamics is 
carbon residence time. Residence time is approxi- 
mated as a mean value (i.e., mean residence time, 
MRT) or turnover time by dividing pool by flux. 
The approximation is valid when the carbon cycle 
is at equilibrium but yields substantial deviation 
from residence times for a multiple compartmen- 
tal system when the carbon cycle is not at equilib- 
rium (Lu et al. 2018b). Residence time or transit 
time is explained in detail in Chapter 15 and can 
be estimated from data assimilation for individual 
ecosystems. Overall, the residence time quantifies 
the duration of carbon staying in an ecosystem 
before being released back to the atmosphere. 

The two terms, NPP and residence time, 
together quantify equilibrium carbon storage 
capacity. Figure 9.3 shows simulated carbon stor- 
age capacity by three land models, the Beijing 
Climate Center (BCC) model, the Canada (CAN) 
model, and the Community Earth System Model 
Biogeochemical module (CESM GBC). In response 
to climate change, all the three models simulate 
increases in NPP but decreases in residence time. 
The CESM BGC simulates the smallest NPP and 
residence time, leading to the lowest carbon stor- 
age capacity. The BCC model simulates the largest 
increase in NPP, leading the large increase in car- 
bon storage capacity. The Canada model simulates 
the highest residence time and thus estimates the 
highest carbon storage capacity. 


However, when models are used to simu- 
late responses of ecosystems to climate changes, 
the carbon cycle is no longer at equilibrium but 
in dynamic disequilibrium. Therefore, we need 
the third term, the carbon storage potential, to 
describe disequilibrium of the land carbon cycle. 
The CESM GBC model has the smallest carbon 
storage potential whereas the CAN and BCC mod- 
els have high storage potential (Figure 9.3). 

We use one more example to explain the three 
dimensions (3D) of land carbon cycle dynamics, 
particularly the third dimension. The third dimen- 
sion, carbon storage potential, is a relatively new 
concept and worth more explanation. 

A free-air CO,-enrichment (FACE) experiment 
was conducted in Duke Forest in North Carolina, 
USA. The FACE experiment started in the mid-1990s 
and ended in late 2000s. The CO, concentration in 
the treatment rings was elevated by 200 parts per 
million above the ambient CO, concentration. 

To illustrate the concept of 3D space of carbon 
dynamics in the FACE experiment, let us make a few 
assumptions. Let us assume that NPP is 1000 g C 
m”? yr”! and residence time is 40 years at ambient 
CO, concentration. Then, we have the steady-state 
carbon pool size in the Duke Forest as 40 kg Cm”? 
before the FACE treatment. We assume that elevated 
CO, treatments increase NPP by 40% but have no 
effect on residence time. The steady-state pool size is 
56 kg C m? at elevated CO, treatments. For the sake 
of simplicity, we also assume no seasonal and diurnal 
changes in forcing variables and carbon processes. 


Total ecosystem resident tlme (year) 


Carbon storage potential 


NPP(GtC/year) 


Figure 9.3. The 3D model output space (NPP in x-axis, carbon residence time in y-axis, and carbon storage potential in color), 
for three models from CMIPS. The points represent the global annual values of carbon storage for the three variables. The con- 


tours represent the carbon storage capacity. Colors show the values of the carbon storage potential. 
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Figure 9.4. Carbon storage dynamics as determined by carbon storage capacity and potential. Panel A presents an ideal case in 


which carbon storage capacity (X,) is a constant while carbon storage potential (X,(t)) and carbon storage itself (X(t)) vary with 


time. In this case, the capacity is assumed to abruptly increase by 40%, mainly due to instantaneous increase in carbon input as 
in an elevated CO, experiment (Luo & Reynolds, 1999). Consequently, the potential immediately increases and then gradually 


declines as X(t) increases toward the equilibrium (i.e., the carbon storage capacity at elevated CO, treatment). Panel B illustrates 


time-dependent X(t), its capacity and potential in a non-autonomous system over day of year (DOY). Seasonal change in the 
capacity is due to change in carbon input, which is low in winter and high in summer. The capacity is a moving target that X(t) 


chases. The rate of chasing is proportional to the storage potential. 


In this example, we have two equilibrium states 
of carbon storage (i.e., carbon storage capacity), 
one at ambient [CO, |, equaling 40 kg Cm”* and the 
other at the elevated [CO,], denoted as X,, equaling 
56 kg Cm” (Figure 9.4A). At the very beginning 
of the FACE experiment, we need to determine in 
which direction and how fast the current carbon 
storage X(t) would change. Generally speaking, 
we expect that the carbon storage changes toward 
the equilibrium state at elevated [CO,] (X,) but 
the rate of the change is fast in the first few years 
and slow in later years. Let us denote the differ- 
ence between the equilibrium carbon storage and 
current storage to be the carbon storage potential 
X,(t). The rate of carbon storage change is propor- 
tional to the potential X,(t). 

The above example illustrates three key findings 
about carbon storage dynamics. First, carbon storage 
is always moving toward the carbon storage capacity. 
Second, the capacity is the ultimate attractor which 
current carbon storage chases (or changes toward). 
Third, the rate of the carbon storage change is pro- 
portional to the carbon storage potential. 

The Duke FACE example assumes equilibrium 
carbon storage capacity. However, the carbon stor- 
age capacity, which equals NPP times residence 
time, is changing over time in the real world as 
both NPP and residence time vary with time. This 
is the reason why the carbon cycle in the terrestrial 
ecosystem is mathematically considered a non- 
autonomous system. 


We use simulated seasonal change of fine root 
biomass in Harvard Forest (Luo et al. 2017) as an 
example to illustrate the non-autonomous sys- 
tem of carbon cycle dynamics. The carbon storage 
capacity of the fine root pool is a theoretical quan- 
tity, which is NPP times residence time. As NPP is 
low in winter and high in summer, the root car- 
bon storage capacity as indicated by the black line 
is low in winter and high in summer (see Figure 
9.4B). The red line indicates the current carbon 
storage in fine roots, which is higher than the 
black line in the winter and lower than the black 
line in the summer. The red line is always moving 
toward the black line. That is a mathematical expla- 
nation why fine root amount declines in fall and 
winter but increases in spring and summer. 

For a non-autonomous system in which both 
NPP and residence time are changing with time, 
the carbon storage capacity still controls the direc- 
tion of carbon storage change. Likewise, the carbon 
storage potential still determines the rate of carbon 
storage change. Therefore, carbon input, residence 
time, and carbon storage potential form a 3D space 
within which all model outputs can be evaluated. 

Zhou et al. (2018) have applied the 3D space 
to evaluate model performance from three model 
intercomparison projects (MIPs): CMIP5, TRENDY, 
and MsTMIP The 3D space is defined by NPP as 
x-axis, residence time as y-axis, and color for 
carbon storage potential to place outputs from 
25 models in the same space to evaluate their 
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performance (see Chapter 18 for details). The 
three variables can also be plotted to indicate cur- 
rent carbon storage in relatively smooth lines, the 
carbon storage potential in shaded areas, and the 
carbon storage capacity in the zig-zag lines over 
a time course. This is the first time we are able to 
evaluate all model outputs from different MIPs in a 
simple, 3D space. See Chapter 18 for more applica- 
tions of the 3D space to the model evaluation. 

Enging Hou has recently led a study to show 
the uncertainty among eight matrix models can 
expand or shrink, depending on differences in 
residence time when carbon input into the eight 
models is the same (Hou et al. in prep.). As all the 
terms to calculate residence time, which are A, 
&(t), K, and B, in equation 1.13 in Chapter 1 are 
standardized, model prediction trajectories can be 
brought to be one identical one. This study indi- 
cates the model uncertainty is all related to carbon 
input and residence time. 


FIVE TRACEABLE COMPONENTS FOR 
TRACEABILITY ANALYSIS 


Now, let us examine the five traceable components 
of carbon cycle dynamics, which have been effec- 
tively used in traceability analysis. 

The matrix equation describes land carbon 
dynamics via carbon input (e.g., NPP), allocation 
of carbon input to different plant parts, carbon 
process rates via litterfall or decomposition of 
organic matter, carbon transfer among pools, and 
environmental scalar. They consist of five compo- 
nents of the matrix equation. Mathematically, the 
five components are largely independent from 
each other in models (Xia et al. 2012) although 
they may interact in the real world. Moreover, each 
component can be further traced to its subcom- 
ponents as far as individual carbon cycle processes 
and their parameter values. This mathematical 
property enables us to trace sources of model 
uncertainty to individual processes and parameters 
via traceability analysis (see Chapter 17). 

This traceability analysis was first developed by 
Xia et al. (2013) for the equilibrium carbon stor- 
age capacity and expanded by Jiang et al. (2017) 
and Zhou et al. (2018) to the transient dynamics 
of the land carbon cycle. The traceability analy- 
sis can trace the model uncertainty in simulated 
carbon storage hierarchically along the trace- 
able pathways down to vegetation traits, climate 
forcing, and soil attributes. Chapter 17 shows an 


authentic traceability analysis that requires matrix 
models to generate all the traceable components 
so that it enables uncertainty analysis to be ana- 
lytically transparent. Chapter 18 shows post-MIP 
traceability analysis that does not require matrix 
models but is applied to any model output. The 
post-MIP traceability analysis can attribute uncer- 
tainty among models to variations in NPP, resi- 
dence time, and carbon storage potential. The 
variations in NPP residence time, and carbon stor- 
age potential can be further traced to environmen- 
tal scalars and model parameters. 

For example, the Australian CABLE model pre- 
dicts lower carbon storage capacity than CLM3.5 
due to lower NPP But CABLE has higher residence 
time than CLM3.5 (Rafique et al. 2016).The higher 
residence time in CABLE is mainly caused by set- 
ting lower decomposition coefficients, leading to 
higher baseline residence time. The environmen- 
tal scalars among the two models are similar. The 
example shows that the traceability framework can 
help trace sources of model uncertainty down to 
individual processes or parameter values. 

The training in unit 3 shows you how we can 
add diagnostic variables in matrix models so that 
we can use the 1-3-5 scheme for uncertainty anal- 
ysis and traceability analysis. Unit 5 will specifi- 
cally show you how to do traceability analysis. 


SUGGESTED READING 


Luo YQ, Shi Z, Lu X et al. (2017) Transient dynamics of 
terrestrial carbon storage: mathematical foundation 
and its applications. Biogeosciences, 14, 145-161. 


QUIZZES 


1. What is the 1-3-5 scheme of the diagnostics 
system? 
2. Please briefly describe traceability analysis. 


3. What does the term “carbon storage potential” 
mean? 


4. A non-autonomous system is 


a. a system which does not conserve mass 
balance. 


b. a system with its properties changing with 
time. 


c. a system with constant pool sizes. 


d. a system with its pool sizes changing with 
time. 
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Sensitivity analysis is an important procedure to 
understand how model parameter values, input 
or forcing data, and sometimes model struc- 
ture, influence behaviors of the modeled system. 
However, sensitivity analysis is rarely conducted 
for complex land carbon models due to high 
computational cost. The matrix approach makes 
it computationally feasible to conduct sensitivity 
analyses of land carbon models. 


WHAT IS SENSITIVITY ANALYSIS? 


Sensitivity analysis is the study of how the vari- 
ability in the output of a mathematical model or 
system (numerical or otherwise) can be divided 
and allocated to different sources of variability in 
its inputs, parameters, and structures (Wikipedia). 
Sensitivity analysis allows us to identify the param- 
eter or set of parameters that have the greatest 
influence on the model output, and helps us to 
understand model dynamics, trace uncertainty 
sources and calibrate model parameters. 

As a common practice, we sometimes change 
the value of one model parameter, conduct model 
simulations with and without the parameter 
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change, and compare the results. Quantifying the 
change in the result per unit change of the param- 
eter can give us a sense of how sensitive the model 
is to that parameter. By repeating the analysis for 
other parameters of the model, we may discover 
which parameters are most important in deter- 
mining the predictions of the model. This is the 
one-at-a-time approach. 

Let's say we have several data points of a vari- 
able X and corresponding values of a second 
variable Y. Suppose we believe that Y and X are 
related in such a way that a change in X induces 
a response in Y. A preliminary approach we could 
take is to make a scatter plot or a regression anal- 
ysis of Y onto X. If there is a steep relationship 
between Y and X, that means Y is very sensitive to 
X: a small change in X induces a large change in Y. 
These examples are sufficient for linear relation- 
ships or simple cases in which the sensitivity of Y 
to change in X does not vary due to interactions 
with other parameters. In dealing with nonlinear 
responses or nonadditive cases, where interac- 
tions among several parameters affect the sensitiv- 
ity of a model's output to its input, variance-based 
sensitivity methods offer an alternative that can be 
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more informative. The variance-based approach 
decomposes and attributes the uncertainty of 
model output to inputs. For example, for a model 
with two inputs and one output, if 30% of the 
variance of the output is attributable to param- 
eter P1, 45% is caused by parameter P2 and 25% 
by the interaction between Pl and P2, we could 
interpret these percentages as measures of the 
model sensitivity (Wikipedia). 


SOBOL SENSITIVITY ANALYSIS 


In this chapter, we focus mainly on one variance- 
based method, the Sobol method, as an approach 
to partitioning the sensitivity of the output of a 
model to its parameters, inputs or structures. Here 
we focus on parameters. Sobol sensitivity analysis 
is a global sensitivity method. Here, “global” refers 
to the whole parameter space, in comparison to 
some methods that can only give us information 
around a limited region of the parameter space, 
i.e., the local sensitivity. The Sobol method relies 
on Monte Carlo sampling and can separate the sin- 
gle and interaction effects among parameters. Land 
carbon models typically consist of many equations, 
parameters, and state variables interacting with 
each other. Nonlinear responses are common in 
these models. A method that measures the sensitiv- 
ity across the whole parameter space and quantifies 
the interactions is very attractive. The shortcom- 
ing of this method is its relatively high compu- 
tational cost. Process-based land carbon models 
often track diverse processes with response scales 
ranging from minutes (e.g, half-hourly time step 
for photosynthetic carbon uptake) to centuries or 
millennia (e.g., turnover of recalcitrant soil organic 
matter pools). Global or regional simulations 
using these models are normally computation- 
ally expensive and rely on the use of supercom- 
puters. A large number of simulations required by 
the Sobol method and the high computational cost 
of complex land carbon models, especially over 
large spatial-temporal scales, make direct applica- 
tions of the Sobol method to complex land carbon 
models very challenging. The matrix version of the 
model, thanks to its efficient semi-analytical spin- 
up, brings us the capability in fulfilling such a task. 

Here I take the ORCHIDEE-MICT model as 
an example. As introduced in Chapters 3 and 
5, the litter and soil carbon component of the 
model simulates carbon dynamics across 100 
pools to capture multiple real-world processes 


that govern terrestrial carbon cycling. Vegetation 
carbon enters litter pools during processes such 
as leaf senescence, root and wood turnover, and 
fire disturbance. ORCHIDEE-MICT tracks four 
classes of litter, namely aboveground metabolic 
litter, belowground metabolic litter, aboveg- 
round structural litter, and belowground struc- 
tural litter. Aboveground and belowground litter 
differ in the environmental conditions (tempera- 
ture and moisture) that modify the rate of litter 
decomposition. Litter carbon (both aboveground 
and belowground) is transferred into vertically 
resolved soil carbon pools as litter decomposes. 
ORCHIDEE-MICT divides soil carbon into 32 lay- 
ers, down to a depth of 38 m. The thickness of soil 
layer increases from shallow to deep soil layers. In 
each soil layer, ORCHIDEE-MICT tracks three dif- 
ferent types of soil carbon pools, that is, the active 
soil organic carbon (SOC) with a default poten- 
tial turnover time of 0.145 year and the slow and 
passive SOC pools with potential turnover times 
of 5.48 and 241 years, respectively. Once it enters 
the soil, carbon is transferred among a complex 
network of soil carbon pools, distinguished by 
vertical layer and turnover time. Ultimately, soil 
carbon can enter the atmosphere through respira- 
tion, be transformed into more recalcitrant pools 
with longer turnover times, or be buried into deep 
soil layers through cryoturbation or bioturbation. 
Advection is not considered in this model version. 
The original model is rewritten into the matrix 
form introduced in Chapter 5. 

Each element in these matrices might be linked 
to several processes or functions as mathematic 
expressions of mechanisms controlling land car- 
bon dynamics. To evaluate which sets of param- 
eters are most important in determining carbon 
dynamics in the modeled system, we perform sen- 
sitivity analysis on the matrix model. 

Based on our understanding of carbon cycling 
and the original model, we choose 34 parameters 
that could potentially be important (see Table 10.1). 
These parameters regulate elements from different 
matrices. For example, the lignin content affects the 
transfer of carbon flux from one pool to another. 
It also modifies the turnover rate. We take advan- 
tage of the matrix version of the model for semi- 
analytical spin-up (see Chapter 14), which saves us 
a lot of computational time. We randomly sample 
parameters across the whole parameter space and 
conduct 2,720,000 (=34x100x100x8) model 
simulations (see more details below). As you can 
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TABLE 10.1 


Parameters affecting litter and SOC dynamics in the ORCHIDEE-MICT model 


Matrix 
Name source Description Default value Range Unit 
1 ins I Input scalar 1 [o,1] 
2 p4if I Partition (structural vs. metabolic) 0.6916 [0,1] 
coefficient for leaf 
3 p4sa I Aboveground sapwood partition, structural 0.598 [0,1] 
vs. metabolic 
4 p4sb I Belowground sapwood partition, structural 0.598 [0,1] 
vs. metabolic 
5 p+ha I Aboveground heartwood partition, 0.598 [0,1] 
structural vs. metabolic 
6 p4hb I Belowground heartwood partition, 0.598 [0,1] 
structural vs. metabolic 
7 p4ro I Root partition, structural vs. metabolic 0.6916 [0,1] 
8 p4fr I Fruit partition, structural vs. metabolic 0.6916 [0,1] 
9 péca I Carbohydrate reserve partition, structural vs. 0.6916 [0,1] 
metabolic 
10 fam2a A Transfer fraction, aboveground metabolic 0.45 [0,1] 
itter to active SOC 
11 fbm2a A Transfer fraction, belowground metabolic 0.55 [0,1] 
itter to active SOC 
12 fas2a A Transfer fraction, aboveground structural 0.45 [0,1] 
itter to active SOC 
13 fbs2a A Transfer fraction, belowground structural 0.45 [0,1] 
itter to active SOC 
14 fas2s A Transfer fraction, aboveground structural 0.7 [0,1] 
itter to slow SOC 
15 fbs2s A Transfer fraction, belowground structural 0.7 [0,1] 
itter to slow SOC 
16 fa2p A Transfer fraction, active to passive SOC 0.004 [0,0.15] 
17 fs2a A Transfer fraction, slow to active SOC 0.42 [0,0.5] 
18 fs2p A Transfer fraction, slow to passive SOC 0.03 [0,0.5] 
19 fp2a A Transfer fraction, passive to active SOC 0.45 [0,1] 
20 zlit A Factor control vertical distribution of litter PFT dependent [0.2, 1.25] 
input to SOC 
21 clay A, Ea Clay content 0.2 [0,0.6] 
22 ke A, €, Lignin coefficient on structural litter 3 [0,10] 
decomposition 
23 lga A, É, Aboveground lignin content 0.76 [0,1] 
24 lgb ÉL Belowground lignin content 0.72 [0,1] 
25 temps Er Temperature sensitivity 0.69 [0,1] 
26 ms E Moisture scalar 1 [0.8,1.2] 
27 tau4ml K Turnover time, metabolic litter 0.066 [0,0.066] year 
28 tau4sl K Turnover time, structural litter 0.245 [0,0.245] year 
29 tau4a K Turnover time, active SOC 0.149 [0,0.149] year 
30 tau4s K Turnover time, slow SOC 5.48 [0,5.48] year 
(Continued) 
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TABLE 10.1 (CONTINUED) 
Matrix 
Name source Description Default value Range Unit 
31 tau4p K Turnover time, passive SOC 241 [0,241] year 
32 cryo V Cryoturbation rate 0.001 [0,1] m?/year 
33 bio V Bioturbation rate 0.0001 [0,1] m?/year 
34 alt V Maximum active layer depth of the last year 0.2 [0,3] m 


see, this is a large number (nearly three million) of 
simulations. It is not realistic to conduct so many 
simulations with the original model due to its high 
computational requirements, but with the matrix 
model it is feasible. Based on the simulation results, 
we calculate the single as well as the total effect 
of the targeted parameters on the output of the 
model, using Sobol’s index. 

To be specific, we randomly sample parameters 
within ranges provided in Table 10.1 assuming a 
uniform distribution for each parameter. In choos- 
ing a uniform distribution, we assume each value 
within the range is equally possible for the param- 
eter. Parameter ranges are chosen based on model 
information, parameter meaning (e.g., the transfer 
fraction cannot be bigger than 1), and empirical 
knowledge. With these randomly chosen param- 
eters, we conduct 34 x 100 x 100 simulations. 34 
is the number of parameters. 100 is the number of 
random values sampled from the potential distribu- 
tion of parameter i. For each random value of param- 
eter i, we randomly sample 100 sets of parameters 
excluding parameter i. The first-order Sobol sensitiv- 
ity index $, for parameter i (p,) is given by: 


V, 


(E... (Y| pi)) 


S; == v(x) (10.1) 


i= 


where V(Y) denotes the total variance; Y | p; indi- 
cates simulation outputs given p;; p_, represents the 
parameter space excluding the ith parameter; and 
E corresponds to the expectation from multiple 
simulations. Similarly, the total-order Sobol index 
is given by: 


E, (Vp. (Ylp=)) 
v (Y) 


Due to the random sampling, the Sobol 
index may be dependent on the specific samples 
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especially when sample size is not big enough, 
the so-called convergence issue. To avoid the con- 
vergence issue, we repeat the previous step eight 
times (34x100x100x8 simulations in total) and 
calculate our final Sobol index as the average from 
the eight replicates. 

The results of the sensitivity analysis are shown 
in Figure 10.1, the left two panels show the param- 
eters ranked by sensitivity from high to low based 
on the effect of the parameter on total soil organic 
carbon. The first panel is the total effect (single + 
interaction), quantified by the total-order Sobol 
index, ST, (Equation 10.2). The second panel 
shows the single effect, quantified by the first- 
order Sobol index, S; (Equation 10.1). As you can 
see, the model is most sensitive to the parameter 
that controls the external carbon input into this 
system (ins), followed by the turnover time of the 
passive soil organic carbon pool. The right three 
panels show the sensitivity of active SOC, slow SOC 
and passive SOC to different parameters at differ- 
ent soil depths. Different carbon pools are sensitive 
to different parameters at different soil depths. The 
point here is, with the matrix version of the carbon 
model, we are able to look into different aspects of 
the model sensitivity in a great deal of detail. 


ONE-AT-A-TIME SENSITIVITY ANALYSIS 


To investigate more closely the sensitivity of the 
model to key parameters, revealed by the Sobol 
analysis, we can perform one-at-a-time sensitivity 
analysis. For example, Figure 10.1 tells us that the 
parameter that regulates external carbon input (ins) 
and that scaling the turnover of passive SOC (tau4p) 
are important in terms of the sensitivity of total 
SOC, while the active layer depth (alt) seems to 
be important in determining the relative amount 
of carbon stored in active, slow and passive SOC 
pools. In Figure 10.2, we change these parameters 
in even steps denoted by different colors and look 
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Figure 10.1. Sensitivities of soil organic carbon to 34 relevant parameters through the Sobol's method. The left two panels are 
for total soil organic carbon. The right three panels are for different soil organic carbon pools in different soil layers. Colors 
indicate different soil layers. Adapted from Huang et al. (2018a). 
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Figure 10.2. Sensitivities of different SOC pools to litter input scalar (top three panels, (a)-(c), ins), passive SOC turnover time 
scalar (middle three panels, (d)-(f), tau4p) and the maximum active layer depth of the last year (bottom three panels, (g)-(i), 
alt) through changing each parameter one-at-a-time. The x-axis corresponds to soil carbon content and the y-axis corresponds 
to soil vertical layers (larger numbers mean deeper soil layers). Colors denote different parameter values. Adapted from Huang 
etal. (2018a). 
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(a) Total Carbon Stock (kgC/m2) (b) Sensitivity (ins) 
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Figure 10.3. Sensitivity of high latitude (> 50°N) total SOC stocks to 20% change in carbon input (ins), maximum active layer 
depth (alt) and turnover time of passive SOC (tau4p). Sensitivity is quantified as the ratio between the change in total SOC and 
the corresponding parameter. Panel (a) shows total SOC stock. Panels (b), (c), (d), (e), (£) show the sensitivity of total SOC to 
the indicated parameter. Adapted from Huang et al. (201 8a). 
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into changes of active SOC (left panels), slow SOC 
(middle panels) and passive SOC (right panels). 
For the carbon input parameter, the sensitivity is 
relatively linear in surface soil layers indicated by 
the even space between different colored lines, 
while the sensitivity to active soil layer depths is 
generally nonlinear. 


SPATIAL PATTERN 


The spatial or geographic pattern of a model's 
sensitivity can often be insightful. This is because 
the importance of different parameters can vary 
depending on the environmental drivers, which 
vary along geographic gradients, and in different 
biomes. In Figure 10.3, the top left panel shows 
total carbon stock simulated by the matrix model 
over the northern high latitudes. The top right 
panel is the sensitivity to the external input param- 
eter (ins), the middle-left panel is the sensitivity to 
active layer depth (alt), the middle right panel is the 
sensitivity to the turnover time of passive SOC pool 
(tau4p), the bottom left panel is the sensitivity to 
temperature (temp), and the bottom right shows the 
sensitivity to the fraction of slow soil carbon trans- 
ferred to passive carbon pools (fs2p).The sensitivity 
to ins is spatially homogenous as the response of 


the model to this parameter is linear, while sensi- 
tivities to alt and tau4p vary with location. 

In summary, the matrix version of the carbon 
model brings flexibility in comprehensive sen- 
sitivity analyses. Hopefully the case study of this 
chapter will provide you with inspiration to make 
use of the matrix version of carbon models for 
your own studies. 


SUGGESTED READING 


Huang, Y.Y., Zhu D., et al. 2018. Matrix-Based Sensitivity 
Assessment of Soil Organic Carbon Storage: A Case 
Study from the ORCHIDEE-MICT Model. Journal 
of Advances in Modeling Earth Systems 10, 1790-1808, 
doi:10.1029/2017ms001237. 


QUIZZES 


1. What is a sensitivity analysis? 


2. Why does it take nearly 3 million simulations to 
understand parameter sensitivities in this study? 

3. Why is sensitivity analysis computationally 
costly? 


4. What are the key benefits of the matrix model 
form for sensitivity analysis? 
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Soil phosphorus supply regulates terrestrial car- 
bon dynamics. To improve our understanding of 
this regulation, we need to improve our under- 
standing of soil phosphorus dynamics first. This 
chapter first briefly reviews research progress in 
modeling soil phosphorus dynamics, with a focus 
on the diversity of representations among models. 
This chapter then demonstrates how to construct 
a soil phosphorus model and how to transfer it 
to a matrix form. Finally, this chapter presents an 
example study to show how assimilating data into 
the matrix model can improve our understand- 
ing of soil phosphorus dynamics and availability. 
Overall, this chapter demonstrates that the matrix 
approach and data assimilation are very useful 
techniques to study terrestrial nutrient dynamics. 


INTRODUCTION 


Phosphorus (P) is a key element of macromolecules 
such as deoxyribonucleic acid (DNA), ribonucleic 
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acid (RNA), and adenosine triphosphate (ATP), of 
which the former two are carriers of genetic infor- 
mation and the last is the carrier of energy during 
biochemical reactions. Given the important roles of P 
in life, plants and other organisms need P for growth 
and reproduction. In natural ecosystems, plants 
mainly obtain P through uptake from the soil. Since 
soil P supply is not always sufficient to meet plant P 
demand, P limits plant production and its response to 
elevated atmospheric carbon dioxide (CO,) in many 
terrestrial ecosystems. Moreover, P can regulate eco- 
system carbon storage and cycling by affecting soil 
microbial activity. Therefore, improving our under- 
standing of terrestrial P dynamics and P-C interac- 
tions is crucial to realistically simulate the land C 
cycle as well as its responses to future global change. 

The application of data assimilation to quantify 
terrestrial C dynamics is well established (e.g., Luo 
et al. 2016). However, data assimilation to inform 
a soil P model has rarely been tried (Hou et al. 
2019), despite suitable soil P measurements being 
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available in the literature (Hou et al. 2018), The 
data assimilation approach is potentially comple- 
mentary to other currently available techniques 
(e.g., isotope dilution technique) in quantifying 
soil P dynamics. For example, it can use multiple 
sources of existing observations (e.g., soil P pool 
size and plant P uptake), simultaneously quantify 
the rates of all major soil P processes, and pro- 
vide information about the uncertainties of the 
parameters related to soil P Moreover, it may be 
particularly useful to quantify the dynamics of 
slow-cycling soil P pools (e.g., soil occluded P), 
which can hardly be achieved by any currently 
available experimental technique. 

This chapter will first briefly review research 
progress in modeling soil P dynamics, with a focus 
on the diversity of representations among models. 
The chapter will then propose how a matrix frame- 
work and data assimilation can potentially improve 
our modeling of soil P dynamics. After that, an exam- 
ple study is presented to show how assimilating data 
into a matrix model can help to improve our under- 
standing of soil P dynamics and availability. 


A BRIEF OVERVIEW OF SOIL P DYNAMICS 
MODELS 


Soil P dynamics is a key component of terrestrial P 
dynamics. To account for soil P supply as a deter- 
mining factor for plant growth, land models are 
increasingly being extended to incorporate soil 
P dynamics. The first well-known land model to 
include soil P dynamics was CENTURY, published 
in the 1980s (Parton et al. 1988). In the last decade, 
because of growing interest in P cycle regulation 
of the land C cycle, over ten land models have been 
extended by incorporating soil P dynamics. 
Current soil P models share some common 
features. Most are constructed based on the well- 
known soil P pools including soil labile P (readily 
available to plants), organic P (in organic forms), 
secondary mineral P (associated with second- 
ary minerals such as iron and aluminum oxides), 
primary mineral P (apatite P), and occluded P 
(associated with soil clay and minerals and not 
directly available to plants). The dynamics of P as 
it accumulates and flows among these pools are 
usually expressed by a set of equations, based 
on empirical understanding of soil P dynamics. 
Despite a generally good understanding of soil P 
processes, most soil P models are not well parame- 
terized, calibrated, or validated, because of limited 


long-term observations of soil P dynamics. In cur- 
rent modeling practice, soil P models are typically 
not fully spun up to a steady state before formal 
simulations, even though spin-up to steady state is 
normally regarded as essential for land modeling. 
High computational cost associated with the ini- 
tialization of slow-cycling soil P pools such as soil 
occluded P, as well as the unidirectional change in 
soil primary mineral P with time, make a full spin- 
up impractical for most current models. 

Current P models differ both in structure and 
parameters. Regarding structure, some models 
include both water soluble P and labile P pools. 
The former can be directly available to plants and 
the latter may not be directly available to plants 
but exchangeable with the water soluble P pool. 
However, other models do not include a soil water 
soluble P pool and assume that soil labile P is directly 
available to plants. Both empirical studies and the- 
ory have suggested that soil occluded P pool may 
deplete in some conditions (e.g., an anaerobic envi- 
ronment) and accumulate in some other conditions 
(e.g., an aerobic environment). However, most land 
models that incorporate soil P dynamics assume that 
soil occluded P accumulates all the time. Moreover, 
model structure with respect to soil organic P dif- 
fers largely among models, because soil organic P 
dynamics are coupled with soil organic C dynam- 
ics via soil C:P ratios in models, while models have 
very different structures for soil organic C dynamics 
(e.g., different numbers of soil organic C pools and 
vertically resolved vs. unresolved schemes). 

Parameter values scaling soil P dynamics also 
differ among models. The literature provides many 
observations of soil P pools but few observations of 
soil P fluxes, especially fluxes from slow-cycling soil 
P pools (e.g., secondary mineral P and occluded P). 
Given these limitations, most soil P dynamics mod- 
els are poorly parameterized and thus suffer from 
large uncertainties when they are used to predict 
future changes. Moreover, regulation by environ- 
mental factors (e.g., soil temperature and moisture 
content) of soil P dynamics and enzyme-mediated 
soil P mineralization are not well represented in 
current soil P models. In summary, the modeling of 
soil P dynamics is still in its infancy. 


MATRIX APPROACH TO SOIL P MODELING AND 
DATA ASSIMILATION 


The matrix approach and data assimilation can 
potentially help soil P modeling in several ways. 
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First, a matrix representation of the soil P system 
makes it easier to depict, understand, and compare 
models than the traditional approach using many 
difference equations. In the standard approach, 
change per time step in each soil P pool is repre- 
sented by one equation with multiple state vari- 
ables and parameters. Different soil P pools are 
mathematically linked by cross-terms in multiple 
equations, but such linkages may not be appar- 
ent to experimentalists who may have the data to 
validate the model but may not have any model- 
ing experience. A matrix representation of soil 
P dynamics transfers the equations for different 
soil P pools into a unified form, no matter how 
many equations or what type of model, as long as 
a model is structured to describe transfers among 
multiple soil P pools. The matrix representation of 
soil P dynamics facilitates our understanding of 
model structure (e.g., the number of pools and the 
linkages among pools) and enables a direct com- 
parison of structure among models. 

A matrix representation of soil P dynamics also 
has the advantage of enabling a fast model spin- 
up and data assimilation that may be difficult 
or impossible with a traditional P model due to 
the prohibitive amount of computational power 
required. Soil P pools besides primary mineral P 
are generally in equilibrium in natural terrestrial 
ecosystems. Therefore, equilibrium soil P pool 
sizes are usually needed before a model simula- 
tion or data assimilation procedure. The turnover 
of soil occluded P in the field is typically slow 
(hundreds to tens of thousands of years), poten- 
tially even slower than the turnover of soil passive 
organic C (hundreds to thousands of years) which 
constitutes a bottleneck in spin-up of carbon-only 
ecosystem models. Therefore, it is a computational 
challenge to spin up a soil P model in the tradi- 
tional way (e.g., repeat forcing many times). If 
soil P dynamics are represented in a unified matrix 
form, the semi-analytical spin-up (SASU) method 
introduced in Chapter 14 can be used to perform 
model spin-up, with the pool size of soil primary 
mineral P set to be constant. Application of SASU 
to soil P modeling may accelerate model spin-up 
by one or more orders of magnitude and may 
make it computationally feasible to assimilate soil 
P measurements from the field into models. 

Data assimilation enables a quantitative and 
predictive understanding of soil P dynamics and 
availability. The understanding is fundamental 
to an accurate management of soil P availability, 


which can further meet the societal needs of effi- 
cient use of P fertilizer in croplands, prohibiting 
eutrophication of water bodies, and developing 
strategies to alleviate P constraints on terrestrial 
C sequestration. To this end, an example is given 
below to show how matrix model and data assimi- 
lation approaches can be used in soil P studies to 
gain insights into soil P dynamics and availability. 


AN EXAMPLE OF APPLYING A MATRIX MODEL 
AND DATA ASSIMILATION TO SOIL P 


Data Selection and Description 


The example assimilates datasets reported in 
Guo et al. (2000) into a soil P dynamics model. 
Guo et al. (2000) reported consistent (i.e., using 
the same fractionation procedure) and repeated 
(seven or eight times) measurements of P frac- 
tions of eight soils that represent four of the twelve 
major USDA soil types. The datasets are selected, 
because most soil P fractions changed substan- 
tially during the study period, which potentially 
offers a constraint on modeled soil P dynamics. 
Additionally, the soil samples are representative of 
common soil types, promoting the scope of appli- 
cability of the model. A less desirable feature of 
these datasets is that they are derived from experi- 
ments performed in the relatively artificial condi- 
tions of a greenhouse. One likely consequence is 
that the P pools in these experiments might have 
higher turnover rates than in the field. 

In a nutshell, Guo et al. (2000) used crops to 
remove labile P from the eight soils to trigger 
changes in P fractions in these soils over a total of 
14 cropping periods. After every two croppings, 
they sampled small amounts of the soils to deter- 
mine soil P fractions using a modified Hedley P 
fractionation procedure. They fractionated soil P 
into several soil P fractions, corresponding to P 
pools with different turnover times and plant avail- 
ability. Further details of the experimental design, 
cropping, sampling and preparation of the soils, 
and determination of the P fractions and physi- 
cochemical properties of the soils, are available in 
Guo et al. (2000). 


Construction ofthe P Matrix Model 


Our example follows the study of Hou et al. 2019. 
Here we construct a soil P dynamics model that 
aligns with the datasets acquired by Guo et al. 
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Pa 
(Plant P uptake} 


Figure 11.1. Schematic representation of the soil P m odel. 
Primary P indicates primary mineral P; secondary P indicates 
secondary mineral P; Pi indicates inorganic P; Po indicates 
organic P An ‘a’ on an arrow indicates the coefficient of soil P 
transformation: plant immobilization (a,,), weathering (a,,), 
microbial immobilization (a,,), mineralization (a,,), sorption/ 
precipitation (a;,), desorption/dissolution (a,,), and solid- 
phase transformations (a,;, 452, 436, and a,,). Derived from Hou 
et al. (2019). 


(2000). Our model groups the measured soil P 
fractions into six ecologically meaningful soil P 
pools, whose dynamics we track in the model. 
These six pools are labile P (P, in Figure 11.1), 
non-occluded Po (P, in Figure 11.1), secondary 
mineral P (P, in Figure 11.1), primary mineral 
P (P, in Figure 11.1), occluded Po (P, in Figure 
11.1), and occluded Pi (P, in Figure 11.1). The 
labile P is inorganic P that is readily available to 
plants. The non-occluded Po is organic P that is 
sorbed by soil particles or secondary minerals 
(e.g., aluminum and iron oxides) and that can 
be mineralized by enzymes. The secondary min- 
eral P is inorganic P that is sorbed by secondary 
minerals, which is not readily available to plants 
but is exchangeable with the labile P The pri- 
mary mineral P is P that is associated with pri- 
mary minerals and exists mainly as apatite P The 
occluded Po is organic P that is stabilized by soil 
minerals and that is presumably not mineralizable 
by enzymes unless dissolved. The occluded Pi is 
inorganic P that is occluded by soil minerals or 
aggregates and turns over very slowly. In Guo et 
al. (2000), the occluded Po and occluded Pi were 
not separated but determined together as residual 
P (P not extracted by the chemical reagents used). 
A proportion of residual P in organic forms (OP,) 
is introduced here to represent the amounts of 
occluded Po (calculated as residual P x OP,) and 
occluded Pi (calculated as residual P x (1—OP,)). 


We consider all major soil P transformations in 
our model, as detailed in Figure 11.1. We calcu- 
late plant P uptake during a specific period as the 
sum of the decreases in P,_, during the period. For 
instance, we calculate plant P uptake after crop- 
ping 14 as the difference between the sum of P, 
at cropping 0 and the same sum after cropping 
14. We don’t consider P leaching in our model, 
because soil moisture content was maintained near 
soil available water capacity during the experi- 
ment. We don’t consider atmospheric P deposition 
in our model either because the experiment was 
performed in a greenhouse. Moreover, we don’t 
consider litterfall in our model, because the plants 
were young (< 45 days of growth) and unlikely to 
produce any litterfall. Matlab code and the eight 
datasets of soil P fractions are freely accessible via 
Hou et al. (2019). 

Reflecting the structure depicted in Figure 
11.1, we represent soil P dynamics in the model 
by the following set of balance equations. 


= d,4P4 (t) + 0) >P, (t) + a13P3 (t) kP, (t) 


T azsPs (t) kP, (t) 
PO ap) az (t) kP; (t) 


(11.1) 


Note that we don’t list plant P uptake in 
Equation (11.1), because we treat it as a soil P flux. 
We don't use a,, in Equation (11.1), because we 
can estimate it directly from a,, and a;, by the fol- 
lowing equation: 

dp = 147,43, (11.2) 
We can summarize the set of balance Equations 


(11.1) by the following first-order matrix 
equation: 


(11.3) 
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where A, K, and P(t) are the matrices given by 


-1 dp 413 1 0 0 
a21 —1 0 1 0 
ge dzi 0 -1 0 0 1 
0 0 —1 0 0 
0 1d 0 -1 0 
0 0 1413 0 0 —1 
k, 0 0 0 0 0 
0 k: 0 0 0 0 
0 0 k; 0 0 0 
= diaa (k) = 

j ing ( ) 0 0 0 k4 0 0 
0 0 0 k; 0 
0 0 0 0 0 kg 

P(t) 

P, (1) 

Ps (t) 

P(t)= 

Oo 

Ps (1) 

Ps (1) 


Matrix A gives the transfers of P between the 
individual P pools, as described by the arrows in 
Figure 11.1. The elements (aj) are the P transfer 
coefficients, representing the fraction of P entering 
the ith (row) pool from the jth (column) pool. as, 
is calculated as 1 — a,,; ag, is calculated as 1 — aj»; 
4,4, 4,5, and a,, are fixed at 1.0. Kisa 6 x 6 diagonal 
matrix representing the release rates of six soil P 
pools (units: g P g`! P d”!; for convenience, gg! 
d”! was used in the following), i.e., the amount 
of P leaving each of the soil P pools per day. P(t) 
describes the sizes of the soil P pools at time t. 


Model Validation and Data Assimilation 


We first validate the matrix P model with the eight 
datasets of soil P pool measurements in Guo et al. 
(2000). We use the same parameter values for sim- 
ulating P pools of all the eight soils. In general, the 
soil P model can simulate temporal changes in soil 
P pools reasonably well, with a better performance 
for labile P, secondary mineral P, and occluded P 
than for non-occluded Po and primary mineral P 
We then use a data assimilation approach to 
estimate the values of parameters describing soil P 


dynamics. The approach, known as the Metropolis- 
Hastings algorithm, is described in more detail in 
the lectures and practice in unit 6 on terrestrial C 
dynamics (see Chapter 22). We run data assimila- 
tion formally for five replicates and 500,000 times 
for each soil to examine the convergence of the 
parameters. We test the convergence of the sam- 
pling chains by the Gelman-Rubin (G-R) diagnos- 
tic method to ensure that the within-run variation 
is roughly equal to the between-run variation. 
The Gelman-Rubin method is described in detail 
in Chapter 22. 

After data assimilation, the soil P model can 
simulate well temporal changes in P pools of all 
the eight soils. Results on two soils are shown in 
Figure 11.2. Relationships between the measured 
and modeled soil P pools were mostly significant 
(P < 0.05), with R? values generally larger for 
labile P (mean 0.90), secondary mineral P (0.82), 
and occluded P (0.87) than for non-occluded Po 
(0.43) and primary mineral P (0.50). The rela- 
tively poor simulation of non-occluded Po was 
probably due to its relatively large measurement 
errors and dynamic nature. 


NEW KNOWLEDGE EMERGING FROM DATA 
ASSIMILATION WITH THE MATRIX MODEL 


Soil P Dynamics Quantified by Data Assimilation 


Maximum likelihood estimates and uncertainty 
of the turnover rates of all soil P pools were esti- 
mated, including those of soil occluded Pi and 
occluded Po, which are rarely quantified in mea- 
surements. Estimated parameter values for a 
slightly weathered soil and a strongly weathered 
soil are shown in Table 11.1. Turnover rates of 
soil inorganic P pools generally decreased in the 
following order: labile P > secondary mineral P 
> occluded Pi (Table 11.1). The turnover rate of 
soil non-occluded Po was faster than that of soil 
occluded Po (Table 11.1). 

The turnover rates of soil P pools (Table 
11.1) were generally comparable with the values 
reported in previous studies using isotopic and 
spectroscopic measurements in the laboratory. 
However, the turnover rates of soil P pools here 
could be higher than those in the field, because the 
datasets used for data assimilation were derived 
from an experiment in a greenhouse environment, 
where plant P uptake and soil P depletion could be 
faster than in the field. 
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Figure 11.2. Observed vs. simulated temporal changes in P pools of two soils. The two soils are Honouliuli and Paaloa, 
which are typical Vertisols (slightly weathered) and Oxisols (strongly weathered), respectively. Derived from Hou et al. (2019). 


Rates of transformations among all major P 
pools were also estimated (Table 11.1). These esti- 
mates can convey deep insights into soil P dynamics 
and soil P bioavailability. For example, the propor- 
tions of labile P flowing to secondary mineral P 
(mean 0.35) and non-occluded Po (0.52) were on 


average larger than that flowing to plants (0.13) 
(Table 11.1). This result suggests that soil second- 
ary minerals and microbes were stronger competi- 
tors of soil labile P than plants. Both turnover rate 
of labile P and transfer coefficients related to labile 
P differed among soils (Table 11.1), suggesting that 


Physicochemical properties and maximum tele esis of model parameters describing P dynamics 
of two soils 
Parameter Unit Honouliuli Paaloa 
Soil order Vertisols Oxisols 
Weathered extent Slightly Strongly 
Total P concentration at the start of experiment g kg” 1840 596 
0.5 M NaHCO, extractable P concentration mg kg"! 26.3 1,1 
pH in water 7.26 5.05 
Organic C concentration g kg” 16.6 40 
Exchange cation concentration cmol, kg”! 30.6 5.5 
Acid ammonium oxalate extracted Fe concentration g kg” 3.4 7.48 
Acid ammonium oxalate extracted Al concentration g kg” 1.43 2.98 
Sand content g kg”! 56.5 53.8 
Silt content g kg” 363.6 205.4 
Clay content g kg” 580 740.8 
Turnover rate of labile P (k,) ggid! 0.04 0.048 
Turnover rate of non-occluded Po (k,) gg*d” 0.022 0.073 
Turnover rate of secondary mineral P (k,) ggd! 0.044 0.012 
Turnover rate of primary mineral P (k,) ggd?! 0.00193 0.00015 
(Continued) 
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TABLE 11.1 (CONTINUED) 

Parameter Unit Honouliuli Paaloa 
Turnover rate of occluded Po (k,) ggd 0.0077 0.0067 
Turnover rate of occluded Pi (k4) ggd 0.0062 0.009 
Coef. of transfer from labile P to non-occluded Po (a,,) Unitless 0.51 0.53 
Coef. of transfer from labile P to secondary mineral P (a,,) Unitless 0.33 0.38 
Coef. of transfer from labile P to plant (a,,) Unitless 0.16 0.09 
Coef. of transfer from non-occluded Po to labile P (a,,) Unitless 0.97 0.23 
Coef. of transfer from non-occluded Po to occluded Po (as) Unitless 0.03 0.77 
Coef. of transfer from secondary mineral P to labile P (a,;) Unitless 0.23 0.72 
Coef. of transfer from secondary mineral P to occluded Pi (as) Unitless 0.77 0.28 
Proportion of occluded P in organic form (OP,) Unitless 0.28 0.02 


Derived from Hou et al. (2019) 


labile P dynamics varied among soils. This result 
suggests that not only the amount of but also the 
dynamics of soil labile P control soil P availability. 
The estimated model parameters from data 
assimilation provide information about the data- 
sets and the model in two other aspects. Firstly, 
posterior distributions of parameters can tell how 
well model parameters were constrained by the 
datasets. Secondly, relationships among model 
parameters can reflect relationships defined by the 
model structure, correlations between the soil P 
pools, errors, or any combination of all three. 


Soil P Dynamics in Relation to Other Ecosystem 
Properties 


Soil P dynamics were revealed to differ between 
the lightly and the strongly weathered soils 
(Table 11.1). Turnover rate of secondary min- 
eral P was ~4 times higher in the lightly weath- 
ered soil (0.044 g g™! d~!) than in the highly 
weathered soils (0.012 g g”* d™') (Table 11.1). 
The proportion of labile P flowing to plants was 
higher in the slightly weathered soil (0.16) than 
in the strongly weathered soils (0.09); the oppo- 
site was true for proportion of labile P flowing 
to secondary mineral P (slightly: 0.33; strongly: 
0.38) (Table 11.1). 

The proportion of labile P flowing to plants 
increased with soil pH; correspondingly, the pro- 
portion of labile P flowing to secondary min- 
eral P decreased with soil pH (Figure 11.3c). 
Relationships of soil organic C concentration with 


model parameters were generally opposite to those 
of soil pH (Figure 11.3). These results suggest the 
regulation of plant and soil microbial competition 
for labile P by soil pH and organic C concentration. 
The proportion of labile P flowing to secondary 
mineral P decreased with increasing soil pH. This 
was probably because of decrease in soil P sorption 
capacity with increasing soil pH. This mechanism 
also explains the increase in turnover rate of sec- 
ondary mineral P with increasing soil pH (Figure 
11.3a). Increase in the proportion of labile P flow- 
ing to secondary mineral P with increasing soil 
organic C concentration (Figure 11.3d) may be 
attributable to the sorption of P by organic-metal 
complexations in soils. This mechanism addition- 
ally explains the decrease in turnover rate of sec- 
ondary mineral P with increasing soil organic C 
concentration (Figure 11.3b). 


SUMMARY 


Despite increasing research efforts, there are still 
some uncertainties in the structure of soil P mod- 
els and much larger uncertainties in the parameter 
values of soil P models. These uncertainties can be 
potentially reduced by the adoption of matrix and 
data assimilation approaches in soil P modeling. The 
matrix representation can make it easier to depict, 
understand, and compare soil P models compared 
to the traditional representation as a set of balance 
equations. Matrix approaches can also reduce the 
computational costs of models by enabling semi- 
analytical spin-up and make it computationally 
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Figure 11.3. Soil P dynamics in relation to soil pH and organic C concentration. Turnover rate of secondary mineral P vs. 
(a) soil pH (R? = 0.99, P < 0.001) and (b) soil organic C concentration (R? = 0.87, P = 0.001). (c) Coefficients of transfer from 
labile P to plants (green; R? = 0.53, P = 0.042) and secondary mineral P (red; R = 0.71, P = 0.009) vs. soil pH. (d) Coefficients 
of transfer from labile P to plants (green; R? = 0.56, P = 0.034) and secondary mineral P (red; R? = 0.38, P = 0.106) v. soil 


organic C concentration. Derived from Hou et al. (2019). 


more feasible to assimilate soil P observations into 
models. An example study was given to show how 
to build a soil P model in matrix form, and how 
assimilating soil P observations into the matrix 
model can expose insights into soil P dynamics and 
availability. Overall, this chapter shows that assimi- 
lating soil P observations into a matrix model of 
soil P dynamics can improve our understanding of 
soil P dynamics and availability. 


SUGGESTED READING 


Hou, E., X. Lu, L. Jiang, D. Wen and Y. Luo (2019). 
Quantifying soil phosphorus dynamics: a data 


assimilation approach. Journal of Geophysical Research: 
Biogeosciences 124: 2159-2173. 


QUIZZES 


1. Why is a unified matrix equation preferred over 
traditional balance equations in soil P studies? 


2. Was turnover rate of soil labile P constant across 
soils? 

3. Which model parameter is soil labile P pool size 
most sensitive to? 


4. What purpose was data assimilation used for in 
the example study? 
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This practice helps you understand diagnostic vari- 
ables in biogeochemistry matrix models. The key 
diagnostic variables include carbon storage capac- 
ity, storage potential, residence time, and input. We 
use a TECO matrix model to demonstrate how to 
incorporate diagnostic variables from carbon bal- 
ance equations. We verify the theory that carbon 
storage capacity represents an attractor of carbon 
storage dynamic. Meanwhile, we understand how 
changes in carbon turnover rate and input allo- 
cation fraction affect the carbon residence time 
and therefore the carbon storage capacity. Carbon 
residence time and carbon input are two essential 
diagnostics to characterize the steady state carbon 
storage in land carbon cycle modeling. 


MOTIVATION OFTHE UNCERTAINTY 
DIAGNOSTICS 


Land carbon cycle models are frequently used to 
estimate biosphere-atmosphere feedback. However, 
different models often provide quite divergent 
predictions in land carbon uptake and storage as 
revealed via model intercomparison projects. Great 
efforts have been made to understand differences 
among models. Even so, we still have difficulties in 
identifying causes of model differences. 
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To improve understanding of why models 
behave differently, we need effective diagnostic 
tools. This is one of the motivations for develop- 
ing the matrix approach. The traceability analysis 
with the matrix approach offers a framework to 
understand each traceable component first and 
then assemble them together to analyze the carbon 
storage dynamics. 

The matrix approach enables decomposition of 
modeled carbon storage to a few traceable compo- 
nents according to the mathematical properties of 
the matrix equation. In this way, we can conduct 
traceability analysis (see unit 5) to understand the 
causes of model uncertainty. 


THE MATHEMATICAL FOUNDATION OFTHE 
DIAGNOSTICS OF LAND CARBON CYCLE 
MODELS 


Previous chapters have demonstrated that the car- 
bon storage dynamics in most land carbon cycle 
models can be formulated in a matrix form (Luo 
et al. 2017; Chapter 1): 


(12.1) 


AOMORI ONO. 


dt 
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where X(t), as a vector, represents carbon storage 
in multiple pools, u(t) as a scalar represents the 
amount of carbon input from net primary produc- 
tion (NPP, i.e., photosynthesis minus autotrophic 
respiration), B, as a vector, is the carbon partition- 
ing from NPP to multiple pools. G, as a matrix, 
indicates the carbon transfer network among pools. 

Equation (12.1) can be reformulated to a diag- 
nostic format: 


(12.2) 


X(t)=X.(t)+%,(t) 


where X, is carbon storage capacity and X, is carbon 
storage potential. The carbon storage capacity can 
be further decomposed into ecosystem residence 
time (T;) and carbon input (u): 


x. (t)=t1u(t) 


where ecosystem residence time is defined by: 


(12.3) 


Tr =G'B (12.4) 


EXERCISE 1 


The matrix equation of the TECO model can be 
written into one matrix equation: 


CY) for ate) (rect) 


(12.6) 


u is the C input scalar (gC m? s“'). B repre- 


sents the C allocation fraction vector (unitless). 
A is the partitioning coefficient matrix (unit- 
less). K is a diagonal turnover rate matrix (s~'). 
X is C pool size vector (gC m”?). 


EXERCISE2 (optional) 


The matrix equation of the CLM5 vegetation C 
cycle model can be written in the form of one 
matrix equation: 


ax(t) = Bu(t) + App (t) Ky, (1) X(t) 


dt (12.7) 


+ AgnKgnX (t) + AjKi (t) X(t) 


The carbon storage potential is the product 
of the inverse of matrix G and the carbon storage 
growth rate: 


(12.5) 


In this practice, we are going to focus on the 
carbon storage capacity (X,), carbon storage poten- 
tial (X,), carbon residence time (74) and carbon 
input (u), which are the most important diagnos- 
tics in the matrix approach. 

Equations 12.3-12.5 show that the three terms 
carbon storage capacity (X,), carbon storage poten- 
tial (X,), and carbon residence time (7,) are related 
to the matrix G. Therefore, one of the most critical 
steps to calculate the above diagnostics is to find 
the matrix G, which equals A(t)&(t)K. 

Exercise 1 and Exercise 2 are to identify diagnostic 
variables from the TECO and CLM5 matrix models. 


Ex 1.1 Find the matrix G of Equation 12.6. 
This matrix was defined in Equation 12.1: G 


Ex 1.2 Write down the diagnostic form of 
Equation 12.6: X(t) =... 

Ex 1.3 Identify the residence time, C input, 
C storage capacity and C storage potential from 
the diagnostic form. 


Residence time: 7, = ... 
Cinput:u=... 

C storage capacity: X, = ... 
C storage potential: X, = ... 


u is the C input scalar, (gC m”? s”!). B repre- 
sents the C allocation fraction vector (unitless). 
A is the partitioning coefficient matrix (unit- 
less). K is a diagonal turnover rate matrix (s”?). 
X is C pool size vector (gC m 2). The subscripts, 
ph, gm, and fi represent the partitioning coef- 
ficient and turnover rate related to phenology, 
gap mortality, and fire processes. 

Ex 2.1 Find the matrix G of Equation 12.7: 
CS ane 
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Ex 2.2 Write down the diagnostic form of 
Equation 12.7: X(t) =... 

Ex 2.3 Identify residence time, C input, C 
storage capacity, and C storage potential from 
the diagnostic form. 


CARBON STORAGE CAPACITY AND CARBON 
STORAGE POTENTIAL 


The carbon storage capacity and carbon stor- 
age potential are two important diagnostics. The 
carbon storage capacity represents the maximal 
amount of carbon that a land ecosystem can store. 
The carbon storage potential represents the differ- 
ence between carbon storage capacity and current 


EXERCISE 3 


Run the TECO matrix model (Ex 3.1), then 
change the initial value (Ex 3.2) and input (Ex 
3.3). Learn whether carbon storage changes 
to approach the carbon storage capacity. Learn 
how carbon storage potential changes. 
Ex 3.1. Follow instructions to run the TECO 
matrix model in CarboTrain: 
a. Select Unit 3 
. Select Exercise 3 
Select Default 
. Select Set Output Folder 
Open Source Code 
Read lines 10-44, get familiar with 
parameters, input, and initial value. The 
file test_p3.py can be edited at 
this step. 
g. Run Exercise 
h. Check results in Output Folder. Time- 
dependent variation in carbon input, each 
pool size, residence time, total ecosystem 
carbon storage, total ecosystem carbon 
storage capacity, and carbon storage poten- 
tial appear in the output file output. 
xls. Figures are in results. png. 


HO E 9 e 


Questions: Does carbon storage change 
towards carbon storage capacity? How does 
carbon storage potential change? 


Ex 3.2. Try different initial values 
a. Repeat Ex 3.1, but select “Change ini- 
tial pool size 1” instead of “Default”. 


Residence time: T; = ... 
Cinput:u =... 

C storage capacity: X, = ... 
C storage potential: X, = ... 


carbon storage. The two diagnostics are the first 
tier variables to define land carbon dynamics for 
any modeling study. The carbon storage capacity 
is the attractor and the sign of the carbon storage 
potential represents the direction. That is, the car- 
bon storage will ultimately reach carbon storage 
capacity regardless of initial values, carbon inputs 
and other model parameters. Exercise 3 will allow 
us to verify this notion. 


Open source code and change the initial 
value to (1000.0 20000.0 50.0 3000.0 
200.0 15000.0 40000.0) at line 44 of 
test _p3.py. Run Exercise and check 
the total ecosystem carbon storage and 
total ecosystem carbon storage capacity 
in results.png. 

b. Repeat Ex 3.1, but select “Change ini- 
tial pool size II” instead of “Default”. 
Open source code and change the ini- 
tial value to (280.0 5000.0 15.0 800.0 
50.0 4000.0 10000.0) at line 44 of 
test_p3.py. Run Exercise and check 
the total ecosystem carbon storage and 
total ecosystem carbon storage capacity 
in results.png. 


Questions: If the initial pool size changes, does 
carbon storage change towards carbon stor- 
age capacity? Does the carbon storage capacity 
depend on the initial value? How does carbon 
storage potential change? 


Ex 3.3. Try different C inputs 

a. Repeat Ex 3.1, but select “Change car- 
bon input I” instead of “Default”. Open 
source code and change the C input to 
0.00001123 atline 38 oftest_p3.py. 
Run Exercise and check the total ecosys- 
tem carbon storage and total ecosystem 
carbon storage capacity in results.png. 

b. Repeat Ex 3.1, but select “Change car- 
bon input II” instead of “Default”. 
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Open source code and change the 
C input to 0.00004490 at line 38 of 
test _p3.py. Run Exercise and check 
the total ecosystem carbon storage and 
total ecosystem carbon storage capacity 
in results.png. 


These questions are critical to the conceptual 
foundation of the carbon storage capacity and 
carbon storage potential. Results from Exercise 3 
guide you to understand why carbon storage 
capacity and carbon storage potential are impor- 
tant diagnostics for carbon cycle modeling and 
how the definitions of the two variables reflect the 
inherent property of land carbon cycle. 


RESIDENCE TIME AND CARBON INPUT 


Now let us further explore the carbon storage 
capacity, which is described by two additional 
traceable components: residence time and car- 
bon input. Climate change causes changes in 


EXERCISE 4 


Run the TECO matrix model, and make follow- 
ing changes in parameters or carbon input (Ex 
4.1-4.3). Observe how parameters influence 
the carbon storage capacity. 

Ex 4.1 Try a different turnover rate. 


a. Repeat Ex 3.1, but Select “Exercise 4”, 
and “Low foliage turnover”. Open 
source code and change the foliage 
turnover rate to 8.8 x 107*, which is 
the first element of “temp” at line 26. 
Run Exercise and check the total eco- 
system carbon storage capacity, residence 
time and carbon input at steady state in 
output.xls and results.png. 


b. Repeat Ex 3.1, but Select “Exercise 4”, 
and “Low passive soil turnover”. Open 
source code and change the passive soil 
turnover rate to 7.739 x 107’, which is 
the last element of “temp” at line 26. 


Questions: If the carbon storage capac- 
ity changes, does carbon storage still change 
towards carbon storage capacity? Does the car- 
bon storage capacity depend on the C input? 
How does carbon storage potential change? 
Would the carbon storage always equal the car- 
bon storage capacity at steady-state? What does 
zero carbon storage potential stand for? 


carbon cycling, usually via changes in residence 
time and/or carbon input. For example, rising 
atmospheric CO, concentration usually enhances 
carbon input (u). Rising air temperature usually 
increases soil carbon decomposition and thus 
turnover rate of soil pools (K). When a forest is 
converted to cropland under land use change, the 
carbon allocation to woody tissue is usually set 
to zero, leading to changes in the carbon allo- 
cation fraction (B). Thus, changes in ecosystem 
carbon storage in response to various climate 
change factors can be explained by diagnostics 
related to residence time and carbon input in the 
matrix approach. Exercise 4 will let us explore how 
changes in parameter values related to residence 


Run Exercise and check the total eco- 
system carbon storage capacity, residence 
time and carbon input at steady state in 
output .xls and results .png. 


Questions: Is the carbon storage capacity in Ex 
4.la or Ex 4.1b lower or higher than that in 
the default model (Ex 3.1)? Can the lower or 
higher carbon storage capacity be attributed 
to residence time or carbon input? How do 
changes in turnover of foliage or passive soil 
carbon impact residence time or carbon input? 
Which impact is stronger? Is it the impact of 
lower foliage turnover rate or the impact of 
lower passive soil turnover rate? Why? 


Ex 4.2 Try different allocation fractions. 

a. Repeat Ex 3.1, but Select “Exercise 4”, 
and “High allocation to foliage”. Open 
source code and change the allocation 
fraction to (0.55, 0.45...), which are the 
first two elements of “B” at line 10. Run 
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Exercise and check the total ecosystem 
carbon storage capacity, residence time 
and carbon input at steady state in out- 
put.xls and results.png. 

b. Repeat Ex 3.1, but Select “Exercise 4”, 
and “High allocation to wood”. Open 
source code and change the allocation 
fraction to (0.2, 0.8...), which are the 
first two elements of “B” at line 10. Run 
Exercise and check the total ecosys- 
tem carbon storage capacity, residence 
time and carbon input at steady state in 
output.xls and results.png. 


Questions: Is the carbon storage capacity in Ex 
4.2a or Ex 4.2b lower or higher than that in 
the default model (Ex 3.1)? Can the lower or 
higher carbon storage capacity be attributed 
to residence time or carbon input? How do 
changes in allocation fraction impact residence 
time or carbon input? Why? 


Ex 4.3 Try multiple changes of parameters in 
different C input and turnover rate. 

a. Repeat Ex 3.1, but Select “Exercise 4”, 

and “Multiple Changes 1”. Open source 

code and change the C input to 1.123 x 

107" at line 26 and 38, and change the 

slow and passive soil turnover rate to 

2.99 x 107% and 5.159 x 107, which 


time or carbon input result in changes in the car- 
bon storage capacity. 

Diagnostic capability is one of the most impor- 
tant benefits offered by the matrix approach. This 
practice only provides you with simple cases to 
understand the terrestrial ecosystem carbon cycle. 
To understand the land carbon cycle at the earth 
system scale, we need to analyze wider ranges 
of spatial and temporal variations from multiple 
models in response to rising atmospheric CO, con- 
centration, and climate warming (Lu et al. 2018). 


are the last two elements of “temp” at line 
25. Run Exercise and check the total eco- 
system carbon storage capacity, residence 
time and carbon input at steady state in 
output.xls and results.png. 

b. Repeat Ex 3.1, but Select “Exercise 
4”, and “Multiple Changes 11”. Open 
source code and change the C input 
to 4.49 x 10-5 at line 26 and 38, and 
change the slow and passive soil turn- 
over rate to 2.69 x 107* and 4.643 x 
107%, which are the last two elements 
of “temp” at line 25. Run Exercise and 
check the total ecosystem carbon stor- 
age capacity, residence time and carbon 
input at steady state in output.xls 
and results.png. 


Questions: What are the common features or 
differences in carbon storage dynamics and 
carbon storage capacity between Ex 4.3a and 
Ex 4.3b? Can you see how residence time and 
carbon input cause differences in carbon stor- 
age capacity? What causes these differences in 
residence time or carbon input? 
Understanding how carbon storage capacity 
changes with carbon residence time and car- 
bon input changes is essential in diagnostics 
for carbon cycle modeling. This will be further 
explored in unit 5 on traceability analysis. 
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The matrix equation model for the terrestrial 
carbon dynamics is a system of nonautonomous 
ODEs, which naturally inherits the mathematical 
difficulties in the solution process and stability 
studies of its equilibria. This chapter introduces the 
mathematical properties of this matrix equation 
model. In particular, we will study: (1) the analyti- 
cal solution of the matrix equation model, provid- 
ing a concrete 3-pool terrestrial carbon dynamics 
to demonstrate the solution process; and (2) the 
stability analysis of the matrix equation model. 


INTRODUCTION 


We consider the following system of nonautono- 
mous ordinary differential equations (ODEs): 


x’ = &(t) ACX + Bu(t) (13.1) 


DOI: 10.1201/9780429155659-17 


where X(t) is a vector of carbon pool sizes, X, 
is a vector of initial values of the carbon pools; 
and €(t) is an environmental scalar representing 
effects of temperature and moisture on the car- 
bon transfer among pools, A and C are carbon 
transfer coefficients between plant, litter, and 
soil pools; u(t) is the photosynthetically fixed 
carbon and usually estimated by canopy pho- 
tosynthetic models, B is a vector of partitioning 
coefficients of the photosynthetically fixed car- 
bon to plant pools. For background to this equa- 
tion, see chapter 1. 

Our goal is to develop an analytical solu- 
tion of (13.1), and to understand and predict 
how the equilibrium state stability is impacted 
by various environmental scalar functions &(t), 
transfer matrices A and C, and photosynthetic 
input u(t). 
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ANALYTICAL SOLUTION 


In mathematical language, governing equation 
(13.1) is a system of non-homogeneous nonau- 
tonomous ODEs, and the derivation of its solu- 
tion is complicated. Therefore, instead of deriving 
the solution of (13.1) directly, we will start with 
studying the analytical solution of a scalar non- 
homogeneous nonautonomous ODE, and then 
extend it to the system situation. 


First Order Non-homogeneous Scalar Equation 


A first-order linear non-homogeneous scalar equa- 
tion has the following general form: 


x'+p(t)x=4q(t) (13.2) 


A standard way to solve (13.2) is the integrat- 
ing factor method, which proceeds as follows. Let: 


P(t) = f p(s) ds, 
then the integrating factor is given by: 


(13.3) 


After multiplying the integrating factor (13.3) 
to both sides of (13.2), we get: 


eft’) (x + p(t)x) = (q(t) 


13.4 
ex + eOp (t)x = O) ee 


Integration by parts allows us to combine the 
left-hand side of (13.4) into one derivative: 


(<x) = e(o) 


Therefore, integrating the entire equation gives 
us the analytical solution formula for the first- 
order linear non-homogeneous scalar equation 
(13.2): 


t (13.5) 


where: 
C= x(to) 
Let's take a look at the following straightfor- 


ward example to get ourselves familiar with this 
solution procedure: 


xx= e", (13.6) 
The integrating factor in this case is: 
Pas 
n(t)ae" =e TEA (13.7) 


Now, we multiply the integrating factor (13.7) 
to the ODE (13.6): 


then integrate this equation, we get: 


e'x=e +C, 
therefore: 

x = e” + Ce' 
where: 

C= x(0) -1 


One-Pool Model 


Now, we consider a one-pool model, that is, the 
vector X in (13.1) becomes a scalar, then (13.1) 
becomes a first-order linear non-homogeneous 
scalar equation (13.2) with 


p(t) = -ë (t) AC, q(t) = Bu(t). 


where X(t), A, C and B are all scalars, instead of 
matrices. We can apply the solution formulae 
(13.5) to solve the one-pool terrestrial carbon 
cycle system model (13.1), we have that: 


x(t) = etl) fe Bu(s)as +C 


to 
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where: 


t 


P(t) = f -€(s) acas,c = X(t) 


to 


Homogeneous Nonautonomous ODEs System 


Before moving on to study the analytical solu- 
tion for the general n-pool model (13.1), we first 
prepare ourselves with the solution to an initial 
value problem of a homogeneous nonautonomous 
ODEs system in the following form: 


(13.8) 


where: 


T 
e = 0,:::,0 sl, 0,---,0 
— — 
i—l copies of 0 n—i copies of 0 


is an n-vector with only the ith entry equals to 
1, and all the other entries equal to 0. The term 
“homogeneous” means that every term in equa- 
tion (13.8) includes X, and the term “nonautono- 
mous” means that X’ depends on the independent 
variable t explicitly, i.e., the right-hand side of 
equation (13.8) contains the term A(t) which is a 
function of t. Note that system (13.8) is one sys- 
tem with n different initial conditions. If we solve 
(13.8) fori = 1, 2, +++, n, we will get solutions X0), 
X@), ---X® corresponding to the n different initial 
conditions €;s (i = 1, 2, «++, n), and this set of solu- 
tions forms a fundamental matrix: 


(0) s[e] (13.9) 


of the homogeneous nonautonomous ODEs 
System: 


X = A(0)X. (13.10) 
Non-homogeneous Nonautonomous ODEs System 
Now we consider the following nonhomogeneous 


nonautonomous ODEs system: 


X'=A(t)X+g(t). (13.11) 


The difference between equation (13.11) and 
equation (13.10) is that equation (13.11) includes 
a non-homogeneous term q(t), i.e., a term which 
doesn't include X, that's why equation (13.11) is 
called an non-homogeneous nonautonomous ODEs 
system. To look for the solution of equation (13.11), 
we will need to employ the fundamental matrix 
D(t) given in equation (13.9) of the correspond- 
ing homogeneous system (13.10). Assume that the 
solution of equation (13.11) has the following form 


x=0(0Q(1) (13.12) 


where: 


is a n-vector, i.e., the solution is a linear combina- 
tion of the basis of the solution space of the corre- 
sponding homogeneous system (13.10). Then we 
plug the solution formula (13.12) into equation 
(13.11), we will get: 
D'(t)Q+0(1)0' =ADQ+g. (13.13) 
Since D(t) is a fundamental matrix of equation 
(13.10), that means that every column of ®(t) is 
a solution of equation (13.10), therefore we have 
that the fundamental matrix P(t) also satisfies 
equation (13.10), i.e.: 
D'(t)= AD (0). (13.14) 
Equation (13.14) allows us to cancel the &’(t)Q 
and AD(t)Q terms on the left and right hand sides of 
equation (13.13), then equation (13.13) becomes: 


P() =g 
Equivalently: 
Y=2 (t)g(t) 
because matrix ® as a fundamental matrix, is 


clearly invertible. To solve for Q, we integrate this 
equation, and get: 


Q - fo~ (s)g(s)ds+C 
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Therefore, the solution to the non- 
homogeneous nonautonomous ODEs system 
(13.11) is: 

X(t) =@(t)Q=@(t)C 
+ (t) [O-"(5)q( ss pore 
where: 
C=O" (to) X(t) 
N-Pool Model 
The n-pool model: 
X' = č (t) ACX + Bu(t) 
X(0)=X, (13.16) 


is in principle a non-homogeneous nonautono- 
mous ODEs system (13.11) with: 


A(t) =8 (t) AC 
q(t) = Bu(t). 


Therefore, we can apply the solution formulae 
(13.15) to solve (13.16): 


x(t) = (1) C+ (9 fØ (s)a( sa 


aclé(o)io T scjelojo 
Xo + [e 


1 5 u(s)ds B (13.17) 


= e 
(a) 0 
tl TS 


(4) 


Notice that, in this case, the fundamental matrix 
of the corresponding homogeneous system: 


X' = &(t) ACX 
Is: 
D(t)= a (13.18) 
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The calculation of the matrix exponential 
Acfe(o)ie 
@ 9 is rather complicated. In principle, a 
matrix exponential e^ is defined in the same way as 


the scalar exponential: 


ge y= 
n! 
n=0 
to be: 
00 
A A 
eS — 
n! 
n=0 
If the matrix: 
pa n 
A= diag (a, ji 


is a diagonal matrix with (a) 
i= 
entries, then, 


as the diagonal 


A_j ai A 
e = diag(e ics 
If the matrix A is diagonalizable by an invertible 
matrix P (for example, the eigenvector matrix of 
A), i.e., 


A=PADP” 


where A is a diagonal matrix, then the matrix 
exponential becomes 


e^ = peAp |. 


From this formulation, we see that the essential 
step to calculate a matrix exponential is to diag- 
onalize that matrix. Now, we apply this essential 
step to derive the details in the solution formulae 
(13.17). In order to calculate terms (a) and (d) in 
(13.17), we first introduce an auxiliary function: 


e(t) = Act. 


To calculate this matrix exponential, we first 
find the eigenvalues of the matrix product AC: 


(13.19) 


Ain 


and a complete set of orthonormal eigenvectors of 
AC: 


Y, i= Leryn 
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and define the eigenvector matrix: i 


AC|E(0)do ) ï 
a ayy (At = 
V = (vva) (a)=e Xo = Väiag(e ( yy 'Xo (13.21) 
so that the matrix product AC can be diagonalized 
b : A : 2 And: 
y the eigenvector matrix V, i.e.: 
AC =Vdiag(A,)_,V”, t aclé(@)io 
(d = [e i u(s)ds B 
which also gives that: A 
ACt = Vdiag(A,t)’ V, ; n 
al Te t a e(o)io 
= i ; A 
then @(t) given in (13.19) can be calculated by: i | diag| e id u(s)dsB 
0 
n = 13.22 
phe) =e = väiag( e )_ v= (13.20) | : oan 
g l 4 felo) 7 

Since the fundamental matrix D(t) given in vf diag] e * u (s) dsv B 

(13.18) can be written as: 0 = 
i ta, flo) 
D (+) =0(H(1)).wherer(1)= fé (5) do = Vdiag f s u(s)ds vB 
0 0 


this matrix exponential is therefore given as: 
This finishes the calculation of the solution 
t (13.17) to the n-pool model (13.16). However, 


D (1) = ae =4 (£(0) we want readers to note that the calculation of the 
eigenvalues As and eigenvectors v,s, which ulti- 
= viiag(e0) y” mately form the eigenvector matrix V of the car- 
isl bon transfer coefficients matrix product AC, is far 
t » from trivial in most of the practical cases. 
A [E(o)to 
= Vdiag| e ° V 


Mathematica Calculation For the Analytical 
Solution of a 3-pool Model 
hence, terms (a) and (d) in (13.17) are calculated 


: We consider a 3-pool model with: 
as following: 


-1 0 0 c 0 0 b 
A= -1 0 E = |o & o| B=|0 
1 -~i 0 0 
RE (13.23) 
E(t) =ke™ u(t) =ke”, 


YING WANG 107 


and we use the mathematical software Mathematica 
to carry out the following calculations. The matrix 
product AC is: 


=41 0 0 
AC=| q =c 0 
0 (0) =p 


The eigenvalues of AC are: 


A G,A 


(2,A3 C3 


and the eigenvector matrix of AC is: 


—ce x, 
(ate)(-1+e" Jia 
bi 


a(-1+e")n 


te > 


(-cx + ¢, (x, + x,)) 


a, = 


a= 


and a, has a much longer expression, and is omit- 
ted here. To facilitate the calculation of term (d) 
given in (13.22), we assume that b, = b, = b, then 


(a =h ) (cı T cs) 0 0 term (d) given in (13.22) has three components: 
[0107] 
y = =c + (3 14 C3 0 ate) 
G Cy b| l-e bi k, 
1 1 1 
d, = 
ck, 
Let the initial value X, be: 
a(-1+e" i af- Ja 
Xı b| a- ce bı +c|-1l+e bi k 
Xo =| x 
d = 
X3 g (cı =) oki 
then term (a) given in (13.21) has three 
components: 
a [tren Ju ore Ju c3 (-1+e" ia 
E a E 
-1+8 || -1+e 7 (a-c) -l+e > 1-2 || -1+e > 
Ca Cy 
bach 5 ky 

Cy Cz C3 


d; = 


Even for a seemingly simple 3-pool model, 
the computation of the analytical solution of this 
model is already very complicated. For models 


STABILITY 


Instantaneous Steady State 


with more pools, the calculation could be much The instantaneous steady-state of (13.1): 


more challenging for many cases. 
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X = E(t) ACX + Bu(t) (13.24) 
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is defined as: 
x, ()=-E(Du(9 ACB (0) 


The instantaneous steady state provides the X(t) 
with zero rate of change at time t. 


(13.25) 


Instantaneous Steady State for a 3-pool Model 


Let us re-visit the 3-pool model (13.23) given in 
the previous section. A simple calculation shows 
that the inverse of the carbon transfer coefficients 
matrices AC is: 


-1/¢ 0 0 
Inverse (AC) =|-1/c  -—1/c, 0 
1/G 1/c 1/63 


and therefore the instantaneous steady state solu- 
tion is: 


bk, 
akı 
bk, 
oki 
bk, 
cok 


Xs (0) =-E" (t)u(t) Inverse (AC)B(t) = 


The Mathematica computational results show 
that all the eigenvalues (A; ie of the matrix AC 
are strictly negative, therefore the instantaneous 
steady state solution (13.25) is asymptotically 
stable. We numerically investigate an example to 
verify that the solution X(t) does converge to the 
instantaneous steady-state solution X,(t) as t > 00. 
To measure the convergence, we need to study the 
following function: 


distance(t) x(t) =X; (1) 
a (13.26) 


= (0) +(4)-X,(0 


where (a), (d) and X,(t) are given in (13.21, 
13.22, 13.25), 

In this numerical investigation, we choose c, = 
1/10, c, = 3/10, c, = 2/10, b = 1,k, = 1,k, = 1, 
b, = 1, and the initial value X, = (5/10, 2/10, 3/10) 
in the 3-pool model (13.23), Figure 13.1 shows 
that the three components of the solution X(t), the 
instantaneous steady-state solution X,(t) and the 
distance given in (13.26). 


An interesting question to ask is how the 
parameters affect the rate of convergence to the 
instantaneous steady state solution. This question 
will be left to the readers to explore. 


Global Attractor 


As the name suggests, a global attractor attracts all 
the solutions regardless of the initial profiles. We will 
introduce the definition of the global attractor, how 
to numerically estimate the global attractor, and a 
mathematical study of the convergence of the solu- 
tions to the global attractor as t > oo. In the end of 
this section, we will also give the condition when the 
global attractor and the instantaneous steady-state 
solution given in (13.25) converge to each other. 

Consider a non-homogeneous nonautono- 
mous system in the general form: 


X' (1) = A(t)X(t)+b(t) 


where A(t) has the block matrix form: 


Ai (t) 0 
A(t)= 
(Y bee Ay o 
Au (t) e Ri ERA A, (1) ep, 
An (t) eR? andd, +d, = d. 


(13.27) 


Assume that: 


1) A(t) is a lower triangular matrix 
2) |[b(t) || < B for some B > 0 (i.e., b(t) is 
bounded) 
3) 30 < 0, such that 
* a(t) <6, fori=1,-**,d, 
* a,(t) <0, fori=d, +1, -**,d (i.e., A(t) 
has negative diagonal entries) 
ay (t)|+6 > 2 a, (t) 
j4ij>d 
+ 1, +, d, (Le, A,,(t) is diagonally 
dominant) 


„fori =d, 


Then: 


[10 (ts) 11 < K? (13.28) 


for some K > 0 and t > s, where D(t,s) is the fun- 
damental matrix of: 


X' (1) = A(t) X(t) 
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Figure 13.1. The three panels show (a) the solution X(t), (b) the equilibrium solution X,(t) and (c) the distance versus time, 
for 0 < t < 5. The blue, red, green curves represent the first, second and third components respectively. 


i.e., the solution of the general nonautonomous Let: 
equation (13.27) can be written as: 


X(:) =© (65)x(9)+ [0 (wx) o(e)a =. 
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then y(t) is the unique global attractor of the non- 


autonomous system (13.27). 


Notice that based on (13.28, 13.29) and the 


assumption on boundedness of b(t), 


¡MONJE f || © (ts)b(s) [I ds 
+ | || D (t,s)b(s) |] ds 


< [1 (-o)6(t-0) do 


t=7 


+ fer Il b(s) Il ds 


< [10 (u-0)x-0)1 do 


+KBe” / 5] 


S [110 (1-0) (1-0) I] do 


0 


(13.30) 


Equation (13.30) gives a way to numerically 
estimate the global attractor p(t). 

All the solutions (regardless of the initial con- 
ditions) will converge to y(t) as t > oo. This is 
because: 


(= (1) = © (t3)(x(s)-n(9) 
Therefore: 


| X(t) — u (e) I< ke I] (x(5) u (s)) 1 


t—>0 


> 0, since ô <0 


The Global Attractor of the N-Pool Model 


Consider: 


X' = E (t) ACX +Bu(t), (13.31) 


its global attractor is: 


t 


u (1) = fo (t,s) Bu(s) ds 


Il 
— 
o 


acfé(o)io y acfé(o)ic f 
=le Xs (s) - fe X,,(s)ds 
t cfle) 
= X,(t e X (s) ds 
Therefore: 
; acle(o)ao , 


u(t)—X.(t)=—[e 


T 


acféloJuo , 
=-f: A X, (s)ds 


General Stability Statements 


CASE 1 


Consider a homogeneous system of ODEs 
with constant coefficient matrix: 


X'= AX. 


Let 4, j = 1, +++, n counting multiplicity be 
the eigenvalues of matrix A, and let X be the 
equilibrium state. 


. if Red, < 0 for allj = 1, ---,n, then X is 
asymptotically stable. 

. if Rel, < 0 for allj = 1, ---, n, and the 
eigenvalues with real part equals to 
zero are simple, then X is stable. 

* if there exists 4, such that Red, = 0, 
but is not simple, then X is unstable. 

* if there exists 4,, such that Red, > 0, 
then X is unstable. 
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CASE 2 


Consider a homogeneous system of nonau- 
tonomous ODEs: 


X'= AX +B(t)X 


where B(X) is a continuous function. Let 
4, j = 1, =, n counting multiplicity be the 
eigenvalues of matrix A, and let X be the 
equilibrium state. 


e if Red, < 0 for all j = 1, =, n, and 
limi so lB(|=0, then X is asymp- 
totically stable. 

e if Rel, < 0 for all j = 1, =, n, and 
the eigenvalues with real part equals 


to zero are simple, and flee! <% 
then X is stable. ‘o 

e if there exists 4, such that Red, > 
0, and lim. [B(t)|=0. then X i 
unstable. 


n 


CASE 3 


Consider the following general system: 


X' = AX+B(t)X+f(tX) 


where B(t) is a continuous function. Let 
X, be an equilibrium solution. If: 


lim, || B(t) |=0 


1) f(t) is continuous in t, and has con- 
tinuous partial derivative in x: 


f(t, 
tim E(t) =0 uniformly int 
eal) x 
then: 


a. If Red, < 0, where A, are the eigenval- 
ues of A, then X = X, is asymptotically 
stable. 

b. If there exists an eigenvalue 4, such 
that Red, > 0, then X = X, is unstable. 


SUGGESTED READING 
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QUIZZES 


Find the general solution of the following ordinary 
differential equations and system of ODEs. 


y+y= 2,y(0) =0 
xy’ + 2y =3x,y(1) =5 


2xy' — 3y = 9x" 


x! 1 -5ifx 2sint 
= + 
y 1 —1Ay —3cost 
Find the critical point(s) of the following systems, and 


determine whether it is asymptotically stable, stable, 
or unstable. 


HG a) 
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This chapter introduces the concept of spin-up in 
land biogeochemical models. Spin-up is a stan- 
dard initialization procedure, which is commonly 
proposed in protocols of model intercomparison 
projects. Conventional spin-up, repeatedly running 
the model with recursive environmental forcing 
to achieve a steady state, is very time consum- 
ing. Semi-analytic spin-up (SASU) enabled by the 
matrix representation significantly improves the 
computational efficiency of spin-up for land bio- 
geochemical models. The implementation of SASU 
in the Community Atmosphere and Biosphere 
Land Exchange (CABLE) model is demonstrated. 
The SASU method has been proved to save about 
92.4% and 86.6% of the computational time for 
spin-up of the global carbon-only and coupled 
carbon-nitrogen models. 


WHAT IS SPIN-UP? 


Spin-up is a standard initialization procedure 
used by land surface models, atmosphere models, 
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ocean models, or earth system models. Generally 
speaking, spin-up provides initial values of state 
variables, which are very critical to the definite 
solution. For instance, you might have heard of the 
so-called butterfly effect in weather/climate fore- 
casting: slightly different initial values might cause 
different or even completely opposite results. 

The initialization procedure is an essential step 
in simulation of land biogeochemical models. 
Model intercomparison projects are often carried 
out in order to elaborate understanding of car- 
bon cycle dynamics and characterize intermodel 
uncertainty. These exercises require a standardized 
protocol in order to ensure that results from dif- 
ferent participating models can be compared on an 
equal basis. Model intercomparison projects have 
for instance been undertaken to assess the anthro- 
pogenic impact on land biogeochemical cycles. 
The industrial revolution is commonly believed to 
have triggered much of the current anthropogenic 
impact such as increasing fossil fuel emissions, land 
use change, and increasing nitrogen deposition. 
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It represents the starting point of a period when 
humans have played a more important role than 
ever before in impacting the earth's geology and 
ecosystems. Land biogeochemical cycles are expe- 
riencing an important transition period unique 
in the history of the earth. So, in the protocol of 
model intercomparison projects, the simulation 
of the transition period in biogeochemical cycles 
from industrial revolution to present-day is often 
called the historical simulation. One of the primary 
goals for the historical simulation is to hindcast 
the ecosystem changes in response to anthropo- 
genic activity since the onset of industrialization 
around the 19th century. Initial values matter to 
the growing trend of ecosystem carbon storage. 
However, determining these initial values can be 
a tricky issue when we have little information on 
the initial state. Direct observations for data such 
as soil temperature, moisture, carbon, and nitro- 
gen storage are not generally available until very 
recent times. Thus, in these protocols, steady state is 
commonly used to determine initial values for the 
historical simulation. 

Steady state is achievable by most land biogeo- 
chemical models as long as we run the model long 
enough. Chapter 1 provided the theoretical foun- 
dation; that is, for given forcing, ecosystem car- 
bon storage tends to grow towards the same steady 
state regardless of the initial carbon storage once a 
model structure is defined and parameter values 
are given. The unique steady state does not rely on 
the initial value as the land carbon cycle converges 
to a steady state that is determined by carbon input 
and residence time. Thus, the procedure of gener- 
ating steady state can be standardized across dif- 
ferent models. 

Conventionally, the steady state of carbon 
storage can be achieved by repeatedly running 
the model with recursive environmental forcing. 
Most model intercomparison projects include a 
protocol to achieve the steady state. For exam- 
ples, Multiscale Synthesis and Terrestrial Model 
Intercomparison Project (MsTMIP) protocol 
requires that the spin-up uses a random climate 
driver data package, constant 1801 land cover, 
constant pre-industrial atmospheric CO, con- 
centration, and constant nitrogen deposition to 
force the model. Under a constant atmospheric 
CO, concentration, constant nitrogen inputs and 
random climate forcing, ecosystem carbon storage 
will ultimately reach a quasi-steady state. As a rule, 
to detect whether steady state has been achieved, 


MsTMIP specifies that the 100-yr mean interan- 
nual change in total ecosystem carbon stocks for 
consecutive years must be below 1 gC m? yr”! 
for 95% of grid cells. The carbon stocks accumu- 
lated at steady state are used as the initial values 
in the transient historical simulation from 1801 
to 2010. This approach, which takes advantage 
of a model's native dynamics to drive the system 
towards the steady state, is called native dynam- 
ics spin-up (ND), and is the most widely used by 
current models. 


DEVELOPMENT OF SPIN-UP APPROACHES 


Several other spin-up approaches have also been 
developed. The motivation to develop alternative 
spin-up approaches to ND in land biogeochemical 
model is to improve the computational efficiency 
of the spin-up. For most models, the ND approach 
usually takes hundreds to thousands of simulation 
years to reach the steady state, which is compu- 
tationally very expensive. Even this poor level of 
efficiency is becoming much lower as ongoing 
development makes models more and more com- 
plex. For example, vertically-resolved soil carbon 
modules are developed to represent carbon ver- 
tical mixing and the soil freeze-thaw impact on 
soil decomposition in permafrost regions. While 
accounting for these additional processes bet- 
ter represents the nature of the heterogeneity, the 
extremely slow turnover rates for deep soil carbon 
in permafrost regions substantially slow down 
the spin-up. A more efficient spin-up approach is 
urgently needed. 

Spin-up approaches that modify ND but 
allow the state variables to naturally accumu- 
late to the steady state without any arbitrary 
adjustments to the state variables themselves are 
called ad hoc approaches. A punctuated nitrogen 
addition approach (PN) manipulates the rate of 
new mineral nitrogen addition (Thornton et al., 
2002; Thornton and Rosenbloom, 2005). PN is 
proposed as a solution to reduce spin-up times 
based on the observation that the time taken 
by a coupled carbon-nitrogen biogeochemical 
model to reach a steady state under ND is mainly 
controlled by the new nitrogen addition. PN is 
faster than ND but the higher nitrogen inputs 
also result in larger carbon pools at steady state. 
Therefore, an additional ND needs to be carried 
out after PN, so that PN results conform to ND 
results. The accelerated decomposition approach 
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(AD) was firstly implemented into the Biome- 
BGC model (Thornton and Rosenbloom, 2005). 
AD arbitrarily increases the decomposition rate 
during spin-up to allow the system to approach a 
steady state faster. AD is informed by the experi- 
ence that scaling of litter and soil carbon decom- 
position produce a new steady state, which is 
linearly related to the scaling constant (Thornton 
and Rosenbloom, 2005). AD is much faster than 
ND, but it also generates lower steady state car- 
bon and nitrogen pools, because interactions 
with the soil passive pool become much stronger. 
To achieve the final steady state, additional ND 
after AD is required, which is called the post-AD 
process. The post-AD process is sometime much 
longer than AD itself. AD has been implemented 
into CLM version 4 (Koven et al., 2013) and the 
following versions (CLM4.5 and CLM5) with 
vertically-resolved soil carbon. 

In contrast to ad hoc approaches, spin-up 
approaches allowing arbitrary change in state 
variables are called generalized optimization 
approaches. The semi-analytic spin-up approach 
(SASU) updates the state variables with steady state 
estimates from an analytic solution of the model. 
The SASU approach was firstly implemented in 
the Community Atmosphere-Biosphere-Land 
Exchange (CABLE) model for carbon and nitrogen 
cycle spin-up (Xia et al., 2012). For carbon-cycle- 
only simulations, SASU saves well over 90% of the 
computational cost for CABLE site-level and global 
simulations. For carbon-nitrogen coupled simula- 
tion, SASU saves over 80% of the computational 
cost. The SASU approach has the additional advan- 
tage that it enables parameter sensitivity analysis 
of the biogeochemical parameters (Huang et al., 
2018b), as well as data assimilation for the estima- 
tion of these parameters (Tao et al., 2020), which 
was prohibitive due to large computational cost 
before SASU was proposed. 

Matrix representations have been derived for 
many land biogeochemical models, in part to 
enable them to use the SASU approach for spin- 
up. In the last decade matrix representations have 
been published for TECO (Jiang et al., 2017; Luo 
et al., 2017), CABLE (Xia et al., 2013), LPJ-GUESS 
(Ahlström et al., 2015), ORCHIDEE (Huang et al., 
2018b), CLM3.5 (Hararuk et al., 2015; Hararuk 
et al., 2014; Rafique et al., 2016), CLM4 (Rafique 
et al., 2017), CLM4.5 (Huang et al., 2018a), and 
CLM5 (Lu et al., 2020). 


The accelerated spin-up (ASU) approach is 
implemented in the Terrestrial Ecosystem Model 
(TEM). ASU is similar to SASU and also belongs 
to the generalized optimization approach, but is 
applied to the traditional as opposed to the matrix 
form of the model. ASU numerically solves the 
model's steady state considering the seasonal cycle 
of carbon and nitrogen storage. Qu et al. (2018) 
found that ASU saved 90% and 85% of computa- 
tional costs for TEM site-level and North American 
regional simulations, respectively. 


SEMI-ANALYTIC SPIN-UP 


The SASU approach is built upon the mathematical 
foundation of biogeochemical cycles in terrestrial 
ecosystems, introduced in Chapter 1.The biogeo- 
chemical cycling of carbon in an ecosystem is 
usually initiated with plant photosynthesis, which 
transfers CO, from the atmosphere into an eco- 
system. The carbon assimilated through photosyn- 
thesis is partitioned into compartments of plant 
biomass, such as leaf, root, and woody biomass. 
Plant biomass lost through phenological turnover, 
mortality or damage by herbivores etc. becomes 
litter, entering metabolic, structural, and coarse 
woody debris (CWD) litter pools. The litter car- 
bon is partially released to the atmosphere as CO, 
respired by decomposing microbes, and partially 
converted to soil organic matter (SOM) in fast, 
slow, and passive pools (Figure 14.1). 

The mean carbon residence time varies greatly 
among different pools, from several months in 
leaves and roots to hundreds or thousands of 
years in some woody tissues and SOM (Torn et al., 
1997). These carbon processes in terrestrial eco- 
systems can be mathematically expressed by the 
following carbon balance equation in a matrix 
form (Luo et al., 2017; Luo et al., 2003): 


=BU (t)+AE(t)KX(t) (14.1) 


Taking the CABLE model (Wang et al., 2011) 
as one example, X(t) = (X,(t), X,(t),..., X,(t))” is 
a 9 x 1 vector describing pool sizes for the nine 
carbon pools leaf, wood, root, metabolic litter, 
structural litter, CWD, fast SOM, slow SOM, and 
passive SOM, respectively. A and K are then 9 x 9 
matrices given by: 
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Canopy Photosynthesis 


Leaf 
(X;) 


Metabolic 


Litter (X,) 


co, 


Root 
(X2) 


Structural 
Litter (X.) 


Fast SOM 
(X) 
Slow SOM 

(Xa) 


Passive SOM 


Woody 
(X3) 


(Xo) 


Figure 14.1. Diagram of the carbon processes of CABLE model on which model Equations 14.1-14.3 are based. SOM stands 


for soil organic matter. 


-1 0 0 0 0 0 0 0 0 
0 = 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 0 
du ap 0 —1 0 0 0 0 0 
A=|4ds asp 0 0 = 0 0 0 0 
0 0 1 0 0 =i 0 0 0 
0 0 0 074 75 076 -1 0 0 
0 0 0 0 dgs ge de7 -1 0 
| 0 0 0 0 0 0 a97 gg, —1| 

(14.2) 

K =diag (k) (14.3) 


where A denotes the carbon transfer matrix, in 
which a, represents the fraction of carbon trans- 
fer from pool j to i. The diag(k) is a 9 X 9 diago- 
nal matrix with diagonal entries given by vector k 
= (k,,k,, ..., Ko)", components kj (Gj = 1, 2, ...,9) 
quantify the fraction of carbon left from pool X, 
(j = 1, 2, ...,9) after each time step. É(t) is an 


environmental scalar accounting for effects of 
soil type, temperature, and moisture on carbon 
decomposition. B = (b,, b,, b,, 0, ..., 0)! represents 
the partitioning coefficients of the photosyntheti- 
cally fixed carbon into different pools. U(t) is the 
input of carbon via plant photosynthesis. Other 
models may have a slightly different set of bio- 
mass, litter and soil carbon pools, but in general, 
Equation 14.1 can adequately summarize C cycle 
processes in most land models (Luo et al., 2015; 
Luo et al., 2017). 

Equation 14.1 cannot be directly solved to 
obtain the steady-state values of carbon pools 
because the environmental scalar &(t), ecosystem 
carbon influx U(t), possibly matrix A, and vector 
B vary with time and driving variables. Since car- 
bon influx involves fast processes, its steady-state 
value Uş can be obtained from a short ND spin- 
up, which we call initial spin-up. The initial spin- 
up uses recursive meteorological forcing to drive 
modeled carbon and nitrogen dynamics. Thus, it is 
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possible to calculate averaged values of the envi- 
ronmental scalar (£), the carbon transfer (A), and 
partitioning (B) coefficients within one cycling 
period of the meteorological variables. With car- 
bon input at steady state U, and the mean values of 
the time-varying variables (£, A, and B), we can 
analytically calculate the steady-state carbon pool 
sizes Xy c by letting the right-hand side of Equation 
14.1 equal zero as: 


Xssc = -(E AK) ° BU, (14.4) 


We can divide X,;. by C/N ratio in each pool 
to obtain steady-state nitrogen pool sizes Xn- 
Equation 14.4 stands for the steady state in the 
absence of soil nitrogen feedback to NPP. Final 
steady state will be achieved by the full procedure, 
which will be described in the next section. 


THE PROCEDURE OF SEMI-ANALYTIC SPIN-UP 
IN CABLE 


The procedure of SASU implementation in a 
land surface model is demonstrated by the case 
of CABLE. In general, the procedure consists of 
four preparation and three execution steps: (1) 
development of flow diagram; (2) organization 
of matrix equation; (3) identification of the time- 
varying elements; (4) coding the analytic solution; 
and three modeling steps: (5) initial spin-up; (6) 
analytic solution of steady state; and (7) final ND 
spin-up. In more detail, these seven steps entail 
(Figure 14.2): 


1. Developing a flow diagram reflecting the 
linkages between ecosystem carbon pools in 
the target model structure, as shown in Figure 
14.1 (also see lectures and practice in unit 1). 


. Organizing CABLE into a matrix equation by 
encoding the linkage between carbon pools 
and fluxes in the carbon transfer matrices A 
and K, and plant carbon partitioning coef- 
ficients in vector B. Values for non-zero ele- 
ments in vector B, and matrices A and K were 
assigned based on parameter values in the 
original CABLE model. 


. Figuring out how the time-varying ele- 
ments of A, B and € in Equation 14.1 are 
determined in the model. 


. Coding the analytic solution of the bio- 
geochemical steady state in CABLE, which 
includes two steps: (a) creating new vari- 
ables to store the mean values of the 
time-varying parameters; and (b) creating 
equations to calculate the analytical solu- 
tions of each pool according to the struc- 
tures of matrix A, K and vector B. 

. Performing the initial spin-up by running 
the model using repeated meteorologi- 
cal forcing until NPP (or all plant pools) 
reach steady state (Uş). Run the model until 
the mean change in NPP over each loop is 
smaller than 0.01% per year. Meanwhile, the 
values of all the time-varying parameters in 
Equation 14.3 are updated by the mean val- 
ues from initial spin-up. 


1. Develop a flow diagram as Fig. 14.1. 


y 


2. Organize the linkage between fluxes 
and pools into matrix A, K, and 
vector B. 


l . Figure out the determinants of each 
i| element of the time-varying variables 
i| (A, B, and & in equation 14.1) in the 


w 


model. 
y 


. Re-code the model. 

+ Setup NPP criterion for the initial spin-up 

* Create new variables to store mean values of the 
time-varying variables 

+ Add equations to calculate the analytical 
solutions of pools 

* Set up a criterion for the slowest pool for the final 

spin-up 


5. Initial spin-up. 

+ Read in the initial parameters and spin-up 
NPP (or plant carbon pools) to steady 
state 

* Meanwhile, save all the values of the 
variables in the equations in step 4 


solution of the steady-state pool 
sizes. 


7. Final spin-up. 

* Read in the analytical solved carbon and 
nitrogen pools, and spin up all pools to 
steady states 


6. Calculate the analytical i 


Figure 14.2. Procedure of the semi-analytical spin-up (SASU). (Reproduced from Xia et al. (2012)). 
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6. Calculating the analytical solution of the 
steady states of carbon and nitrogen pools. 
The steady-state carbon pools are solved by 
setting carbon influx equal to efflux for each 
pool (Equation 14.4). Nitrogen pools are 
obtained by dividing the steady-state carbon 
pools by their C/N ratios at the end of the 
initial spin-up. 

7. Performing the final spin-up by using the 
analytically solved carbon and nitrogen 
pools as initial values until the steady- 
state criterion for the soil carbon pools is 
met. Here our steady-state criterion will be 
that the change in the passive soil carbon 
pool (AC) within each simulation year 
is smaller than 0.5 gC m? yr”!. This crite- 
rion was chosen following a recommenda- 
tion of Thornton and Rosenbloom (2005). 
According to the difference in turnover rate, 
a slower pool needs a longer time to reach 
steady state during the spin-up. When the 
criterion for recognizing that steady state 
has been achieved is small enough the final 
spin-up is determined by the dynamic of the 
slowest carbon pool. 


COMPUTATIONAL EFFICIENCY 


In our example using CABLE, the traditional spin- 
up method spends 2780 and 5099 simulation years 
for carbon-only and coupled carbon-nitrogen 
simulation, respectively, before the change in the 
slowest carbon pool meets the steady-state crite- 
rion (AC < 0.5 gC m™ yr”!; Figure 14.3). After 
implementing SASU, the initial spin-up takes 200 
simulation years to achieve steady states of plant 
carbon pools for both the carbon-only model and 
the coupled carbon-nitrogen model. This allows 
the sizes of all SOM pools to be calculated ana- 
lytically after this initial spin-up (step 6 above). 
With the SASU method, all carbon pools in the 
carbon-only model reach steady states after analyt- 
ical calculation without a need for any final spin- 
up (step 7 above; black arrow in Figure 14.3). In 
the coupled carbon-nitrogen model, SASU needs 
another 483 simulation years to obtain the steady 
states of all pools (gray arrow in Figure 14.3) after 
the analytical calculation. This means that the SASU 
method saved about 92.4% and 86.6% of the com- 
putational time for spin-up of the global carbon- 
only and coupled carbon-nitrogen models to 
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Figure 14.3. Dynamics of global mean passive soil carbon 
pool in carbon-only (filled black diamond) and coupled 
carbon-nitrogen (filled gray circle) simulations with tradi- 
tional method and the SASU framework (dotted lines) using 
the CABLE model. The arrows and open symbols show the 
times and values estimated with the SASU method for carbon- 
only (black) and coupled carbon-nitrogen coupled (gray) 
simulations. (Adapted from Xia et al. (2012)). 


steady states, respectively, in the case of the CABLE 
model (Figure 14.3). 

With the traditional spin-up method, the passive 
SOM pool continued to decrease after it reached 
the steady-state criterion (Figure 14.3). Because 
of the slow turnover rate of this pool, thousands 
of additional simulation years are needed for the 
traditional method to reach a steady state of the 
passive SOM pool. In contrast, the analytical solu- 
tion obtained by the SASU method allowed this 
steady state to be accurately predicted after only 
200 simulation years (Figure 14.3). 

The results of this exercise using CABLE show 
that SASU can reduce the bottle-neck of biogeo- 
chemical model spin-up by around 90%. That 
means spin-up with the SASU method can be 
around ten times as fast as the traditional method. 
The computational efficiency with the SASU 
method is higher than the accelerated decom- 
position method identified by Thornton and 
Rosenbloom (2005) for site-level spin-up in an 
evergreen needle-leaf forest. A similar analytic 
spin-up method developed by Lardy et al. (2011) 
accelerates spin-up with a pasture model (PaSim) 
by up to 20 times as well. 

The SASU method can be easily implemented 
for biogeochemical models at site, regional, and 
global scales once a biogeochemical model is 
converted to a matrix model to enable analytical 
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calculation of steady states of carbon and nutrient 
pools together with initial and final spin-ups. 


SUGGESTED READING 


Xia J, Luo Y, Wang Y-P, Weng E, Hararuk O. (2012). A 
semi-analytical solution to accelerate spin-up of a 


coupled carbon and nitrogen land model to steady 
state. Geoscientific Model Development, 5:1259—1271. 


QUIZZES 


Select one option from the given answers 


1. Which statement for the spin-up is NOT true? 


a. 


Spin-up provides an initial state for model 
simulation. 


Spin-up uses atmosphere CO, concentration 
at pre-industrial level. 


Spin-up can speed up the historical 


simulation. 


. Spin-up in land biogeochemical models aims 


for the steady state. 


2. The motivation to develop different spin-up 
approaches in land biogeochemical model is to 


. improve the simulation results. 
b. 
c. 


d. 
. Semi-analytic spin-up (SASU) in CABLE calcu- 


improve the computational efficiency. 
improve model structure. 


improve model diagnostic capacity. 


lates the steady state of carbon storage by setting 


a. 
b. 
c. 
d. 
. Which statement about CABLE spin-up effi- 


carbon storage to zero. 
carbon influx to zero. 
carbon influx equal to efflux. 


carbon efflux to zero. 


ciency is NOT correct. 


a. 


The CABLE spin-up efficiency for the carbon- 
only model is higher than for the carbon- 
nitrogen coupled model. 


. SASU can be up to ten times as fast as tradi- 


tional spin-up. 


In the coupled carbon-nitrogen model, SASU 
does not need the final spin-up after the ana- 
lytical calculation. 


. With the traditional spin-up method, the 


passive SOM pool continues to decrease after 
it reaches the steady-state criterion. 
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To understand carbon flows through terrestrial 
ecosystems, it is very important to use metrics to 
quantify the time carbon spends in the entire sys- 
tem and in particular compartments. In this chap- 
ter, we introduce the concepts of age and transit 
time as two fundamental metrics that character- 
ize the speed at which carbon flows through eco- 
systems. Age is defined as the time carbon atoms 
spend in an ecosystem, from when they enter 
through photosynthesis until they are observed in 
a particular compartment. Transit time is defined 
as the time carbon atoms require to pass through 
the entire ecosystem, from the time they enter 
through photosynthesis until they are lost in gas, 
liquid, or solid form. We review here mathemati- 
cal formulas for computing age and transit time in 
compartmental systems, distinguishing between 
formulas for autonomous systems in equilibrium 
and nonautonomous systems moving along a 
known trajectory. 


INTRODUCTION 


One of the advantages of representing models in 
the compact form of compartmental systems is that 
we can derive system-level diagnostics that help to 
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better understand system dynamics. Differences in 
process representations, parameterizations, or size 
of compartments required to represent a system, 
can be compared using simple aggregated metrics 
at the level of the entire system. 

Two important system-level diagnostics for 
describing compartmental systems are the con- 
cepts of system age and transit (residence) time 
(Bolin and Rodhe 1973; Sierra et al. 2017). We 
define system age as the age of all atoms or particles 
inside the system, from the time they entered t, 
until the time of observation t. Transit time is defined 
as the average time required for atoms or particles 
to traverse the system from their arrival time until 
they leave in the output flux. In other words, sys- 
tem age characterizes the age structure of all the 
atoms or particles in the system, while transit time 
characterizes the age structure of all atoms or par- 
ticles in the output flux (Figure 15.1). 

It is also possible to characterize the age struc- 
ture of the atoms or particles inside each pool 
or compartment. We define pool age as the time 
elapsed since the atoms or particles entered the 
system until the time of observation t inside a pool 
i (Figure 15.1). Therefore, the system age is the 
aggregated result of the pool ages for all pools. 


123 


te: particle enters reservoir 


Age: t-t, 


t: now 


Output flux 


TIT ; 
ASE E ð 
Bia Ei e 
Poe e: i 
RA > 
y i : 
Se m: 

S a „i: 


Figure 15.1. Graphical representation of the concepts of system age, transit time, and pool age. Mass entering a compartmental 
system can be conceptualized as being composed of small particles or atoms, each of them with a ‘clock’ that measures the time 


they have been in the system since they entered. All particles in the input fluxes have an age of zero. If we collect all particles 


inside the system at any given time and organize this information as a distribution of ages, we obtain the system age distribution 
of particles inside the system. If we collect the particles inside a specific pool and organize particles according to their age, we 


obtain the pool age distribution. Collecting particles in the output flux and organizing this information as a distribution of ages 


provides the transit time distribution. Figure extracted from Sierra et al. (2017). 


Since the age and transit time concepts are 
defined for all individual atoms or particles inside 
a system, we can also think about them in terms of 
distributions that quantify the proportional distri- 
bution of the mass in age classes. Therefore, these 
distributions can be characterized by statistics such 
as the mean, standard deviation, and quantiles 
such as the median. 

In the following sections, we will introduce 
mathematical formulas to quantify age and tran- 
sit time distributions for two separate cases, (1) 
autonomous systems in equilibrium and (2) non- 
autonomous systems. Note that we will not be 
making a distinction between linear and nonlinear, 
because for case (1), linear and nonlinear systems 
in equilibrium can be treated similarly since the 
vector of states does not change once the equilib- 
rium is reached and the system behaves similarly 
as a linear system. For case (2), the formulas rely 
on a linearization of the specific trajectory of a 
nonlinear system, therefore we will first introduce 


the linearization strategy and then provide the for- 
mulas for the linear nonautonomous case. 


AGE AND TRANSIT TIME DISTRIBUTIONS FOR 
AUTONOMOUS SYSTEMS IN EQUILIBRIUM 


The derivation of the formulas for age and transit 
time distribution of linear autonomous systems in 
equilibrium was originally introduced in Metzler 
and Sierra (2018). For their derivation, we were 
able to show that linear compartmental systems are 
analogous to absorbing continuous-time Markov 
chains. This means that linear compartmental sys- 
tems can also be interpreted in a stochastic sense, 
with the deterministic system of differential equa- 
tions representing the macroscopic behavior of 
entire masses, and the absorbing Markov chains 
representing the stochastic behavior of individual 
atoms of particles with respect to their age. For 
details about the stochastic process and derivation 
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of formulas, interested readers can refer to Metzler 
and Sierra (2018) for additional details. 

Let's consider linear autonomous systems intro- 
duced in Chapter 7, of the form of equation 7.2, 
with an equilibrium point given by equation 7.4. 
Let's also consider the 1-norm of a vector, defined 


n 


as Îl, = [v= > 


i=l 
all the elements in the vector. We say that the ran- 
dom variable age a that measures age of atoms or 
particles in the system is distributed according to a 
Phase-Type (PH) distribution of the form: 


vi| , which is simply the sum of 


Note that this density distribution is composed 
of three terms: the vector of fractional release 
coefficients, the matrix exponential of the com- 
partmental matrix evaluated at each value of age, 
and the proportional distribution of mass at steady 
state. Since the fractional release coefficients can be 
computed directly from B, we can say that the sys- 
tem age distribution follows a Phase Type distribu- 
tion with two parameters: the probability vector of 
mass at steady state, and the transition rate matrix 
generated by the compartmental matrix. This can 
be abbreviated as a ~ PH(m,B). 

The mean or expected value E 
age distribution can be obtained as: 


of the system 


-1 
B x" 


where 1 is a vector containing ones, and T is the 
transpose operator. 

To obtain the pool-age density distribu- 
tion, we define first a diagonal matrix with the 
steady-state values for each compartment as 
X*:= diag(xt,....x2) - The vector-valued function 
that returns the age distribution for each pool is 
then given by: 


E [a] = 7 aD? 


x 


and the mean age for each pool: 


E [a] = -(x*) Bix’. 
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The density distribution of the random variable 
transit time 7 is also Phase-Type distributed, with 
the probability vector given by the proportional 
distribution of the input flux B, and the compart- 
mental matrix as the transition rate matrix; i.e. T ~ 
PH(B,B). It can be obtained as: 


with mean transit time given by: 


[ej a= 


[al 


Notice that the mean transit time is given by 
the ratio between the total mass at steady state and 
the total input flux. 


AGE AND TRANSIT TIME DISTRIBUTIONS FOR 
NONAUTONOMOUS SYSTEMS 


We consider now the nonlinear nonautonomous 
compartmental system introduced in Chapter 7 
of the form of equation 7.9, for which we can 
always find a unique numerical solution of the 
form x(t, to, Xp). To obtain time-dependent age and 
transit time distributions for this system, we will 
use the known solution to construct an equivalent 
linear nonautonomous system with the exact same 
solution. Details about the approach are presented 
in Metzler, Miller, and Sierra (2018). 

Plugging in the known solution x(t) = 
X(t,t,,X,) into a new linear version of the sys- 
tem, we can define a new vector of inputs as 
a(t) = u(x(t),t), and a new compartmental 


matrix as B(t):=B(x(t),t). Then, we obtain a 
linear nonautonomous compartmental system of 
the form: 


y()=ú()+B() y(t), t> bo, y(to)=x0, 


which has a unique solution y(t, tọ, Yo). Since we 
assume that the original nonlinear system of equa- 
tion 7.9 also has a unique solution, both solutions 
must be identical; i.e. y(t, to, Yo) = X(t, to, Xo). We 
can then use the solution of a nonlinear nonau- 
tonomous system to construct an equivalent lin- 
ear nonautonomous compartmental system of the 
general form of equation 7.7, which has a general 
solution given by equation 7.8. 
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Age Distributions 


We assume now that the initial content x, has an ini- 
00 
tial age distribution f(a) such that xp = fe (a) da. 


This initial age distribution is then perturbed by 
the time-dependent mass inputs and process rates 
of the system, generating a time-dependent age 
distribution of the form 


f(a t) = g(at)+h(a,t), 


where the term g(a,t) is the time evolution of the 
age distribution of the initial mass in the system, 
and h(a,t) is the time evolution of the age distribu- 
tion of mass that enters the system after ty. 

The nonautonomous age distribution of the 
initial mass is given by: 


8 (at) = iiu) (0): (to) (a—(t-to)) 


where the indicator function 1,(a) of a set S equals 
1 if a ES, or zero otherwise. The state transition 
operator D(t,t,) is defined as in Chapter 7. 

The nonautonomous age distribution of the 
mass that enters the system after t, is given by: 


h(a,t) = Too) (a)-D (t,t T a) . u(t = a) 


To obtain the age distribution of the entire sys- 
tem, we simply sum the densities over all pools as: 


£(a,t) = lE (ao) 


Transit Time Distributions 


To obtain transit time distributions in the non- 
autonomous case, it is necessary to distinguish 
between the concepts of backward versus forward 
transit times. The backward transit time is defined 
as the age of particles in the output flux at the time 
of release from the system t,. Using the fractional 
release coefficients, it is possible to obtain the vec- 
tor of outflow rates at time t, as: 


a (t1)=-DBy(t). 3 esa 
i=l 


The backward transit time distribution can be 
obtained as: 


brrlat,)=zZ (t,)-f(at,) t 2 to 


Now, the forward transit time is defined as 
the age of an atom or particle that enters the sys- 
tem at an entering time t, > tọ and exits at time 
t, = t, + a. The forward transit time distribution can 
be obtained as: 


ferr (ate) =z (t. + a) : f (a,t. +a). 


Both distributions are tightly connected, with 
the forward transit time distribution of particles 
entering at time t, being equal to the backward 
transit time distribution of the particles being 
released from the system at time t, i.e.: 


frrr(a,te) = forr(a,t,). 


FINAL REMARKS 


The compartmental system representation also 
unveils analogies between deterministic systems 
that conserve mass with stochastic systems that 
conserve probabilities. This stochastic representa- 
tion can be used to obtain formulas for the age of 
particles or atoms in the compartmental systems. 
With this approach, we derived formulas for the 
age of mass inside a compartmental system (sys- 
tem age), and the age of mass in the output flux 
(transit time). The concept of age can be very valu- 
able to assess how old carbon and biogeochemical 
elements can be in an ecosystem. The concept of 
transit time can be very useful to understand how 
fast biogeochemical elements are processed inside 
an ecosystem, integrating all transfers and trans- 
formations that may take place. 

There are other opportunities to further 
explore carbon cycle models in a stochastic set- 
ting This could be particularly useful for studying, 
for example, the macroscopic properties at larger 
scales where patterns emerge by the action of 
microorganisms acting at microscopic scales. Also, 
the compartmental system representation may 
help to integrate concepts from other disciplines 
such as graph theory or control theory to address 
a number of questions not being explored yet in 
carbon cycle science. 
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SUGGESTED READING 


A general introduction to the concepts of ages and 
transit times can be found in Bolin and Rodhe 
(1973). More specific results for the derivation of 
formulas and the computation of ages and transit 
times can be found in Rasmussen et al. (2016) for 
the mean of their distributions in nonautonomous 
systems, and for complete distributions in autono- 
mous systems in Metzler and Sierra (2018), and 
for complete distributions in nonautonomous sys- 
tems in Metzler, Miiller, and Sierra (2018). 


QUIZZES 


1. Give examples of systems where the mean tran- 
sit time is higher than the mean system age. 


2. Give examples where the mean system age is 
higher than the mean transit time. 


3. In what type of systems are the mean system 
age, the mean transit time, and the turnover time 
equal? 
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This practice aims to help the reader understand 
the convergence and efficiency of semi-analytic 
spin-up (SASU) for different carbon cycle repre- 
sentations in a model. By conducting the TECO 
spin-up using native dynamics (ND) and the SASU 
methods, we explore the convergence of different 
spin-up approaches, and the spin-up efficiency 
under different carbon input and soil turnover 
rates. Under weak nonlinear parameterization that 
assumes carbon input and soil turnover rates are 
functions of carbon pool sizes, the convergence 
of two approaches is verified and the spin-up effi- 
ciencies are compared. 


SASUTO IMPROVE COMPUTATIONAL 
EFFICIENCY OFSPIN-UP OF BIOGEOCHEMICAL 
MODELS 


Spin-up in biogeochemical models is an essential 
initialization procedure, which sets up the initial 
carbon and nitrogen pool sizes at a steady state for 
a model (see Chapter 14). When running a com- 
plex biogeochemical model, the spin-up is usually 
the most time-consuming procedure. So, the effi- 
ciency of spin-up becomes an important topic in 
biogeochemical model development. 
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Semi-analytic spin-up (SASU) is a recently 
developed technique to improve spin-up effi- 
ciency. The SASU approach builds on the fact that 
a matrix model of the carbon cycle can be semi- 
analytically solved to obtain steady-state values of 
pool sizes (Xia et al. 2012). Once biogeochemical 
models are presented in a matrix form, SASU can 
be easily implemented. 

So far, SASU has been incorporated into the 
TECO, CABLE, ORCHIDEE, and CLM5 models. By 
implementing SASU, the spin-up computational 
efficiency is improved by 70% to 93% for differ- 
ent models under various configurations, includ- 
ing the most complicated, CLM5. CABLE was the 
first model to implement SASU for its carbon 
(C)-only and carbon-nitrogen (C-N) coupled ver- 
sions. Results showed that SASU saves 92.4% of 
computational cost for C-only CABLE and 86.6% 
for the C-N version, when run at the global scale. 
Site simulations showed higher improvement in 
the spin-up efficiency. 


SPIN-UP IN THE SIMPLIFIED TECO MODEL 


We will apply native dynamics spin-up (ND) and 
semi-analytic spin-up (SASU) on a simplified 
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version of the TECO model in Exercises 1-3 in 
CarboTrain. The simplified TECO model was used 
in the Unit 3 practice, Chapter 12. It includes 7 


pools, constant C input fluxes and turnover rate 
parameters. In Exercise 1, we are going to run ND 
and SASU for TECO. 


EXERCISE 1 


Run ND (default) spin-up for TECO matrix 
model (Ex 1.1), then change the spin-up 
approach to SASU (Ex 1.2). Learn whether car- 
bon storage driven by the two alternative spin- 
up approaches converges to the same steady 
state. How fast is the SASU spin-up at reaching 
steady state compared with ND? 

Ex 1.1. Follow the steps below to run TECO 
spin-up using ND approach in CarboTrain: 


a. Select Unit 4 

b. Select Exercise 1 

c. Select Default ND 

d. Select Output Folder 

e. Press Run Exercise 

f. Check results in Output Folder. Time- 
dependent variation in carbon input, 
each pool size, residence time, total 
ecosystem carbon storage, total eco- 
system carbon storage capacity, and 
carbon storage potential in output. 
xls. Figures are in results.png. 
Take note of years for steady state, and 
Total_C_10000 (the total carbon at year 


10000) and complete the first row of 
Table 16.1, below. 


Ex 1.2. Follow the steps below to run TECO 
spin-up using the SASU approach in CarboTrain: 


a. Select Unit 4 

b. Select Exercise 1 

c. Select Default SASU 

d. Select Output Folder 

e. Press Run Exercise 

f. Check results in Output Folder. Time- 
dependent variation in carbon input, 
each pool size, residence time, total 
ecosystem carbon storage, total eco- 
system carbon storage capacity, and 
carbon storage potential in output. 
xls. Figures are in results.png. 
Take note of years for steady state, and 
Total_C_10000 and complete the sec- 
ond row of Table 16.1, below. 

QUESTIONS: 


Does carbon storage converge to the same equilib- 
rium in ND versus SASU? Which spin-up approach 
is faster at achieving equilibrium? 


TABLE 16.1 
Comparison of results from Exercise 1 


Name Description (Ca Eat Cinit Cio000 To 
Ex 1.1 Default ND 0.00002245 1.5478 x 10°° 0 

Ex Default SASU 0.00002245 1.5478 x 1076 0 

Ex 1.3 High Crt ND 0.00004490 1.5478 x 10°° 0 

Ex 14 High (Crear SASU 0.00004490 1.5478 x 10°° 0 

Ex 1.5 High K asssoil ND 0.00002245 3.0956 x 106 0 

Ex 1.6 High Kpasssoil SASU 0.00002245 3.0956 x 106 0 

Note. Cinpy: = carbon input fluxes (gC m~ s'); kpassoi = turnover rate of soil carbon (day”'); Cini = initial total 


carbon pool size (gC m”); Cioooo = total carbon storage at year 10000 (gC m™);T., = spin-up convergence 


time (year). 


PRACTICE 4 


SPIN-UP WITH DIFFERENT MODEL 
PARAMETERS 


Previous studies on CABLE spin-up have shown 
that the C-only and C-N coupled model versions 
differed in efficiency and equilibrium for both ND 
and SASU spin-up. We may speculate that the dif- 
ferences in parameters and model structures may 
be the causes. In Exercises 1.3—1.6, we consider ide- 
alized cases, and use the simplified TECO model to 


Ex 1.3. Tune the carbon input and run TECO 
spin-up using the ND approach in CarboTrain: 


Select Unit 4 

. Select Exercise 1 
Select High carbon input ND 

„ Select Output Folder 
Press Open source code 
Change the input_fluxes to 0.0000449 at 
line 46 and save the code 
Press Run Exercise 

. Check results in Output Folder. Time- 
dependent variation in carbon input, 
each pool size, residence time, total 
ecosystem carbon storage, total eco- 
system carbon storage capacity, and 
carbon storage potential in output. 
xls. Figures are in results.png. 
Take note of years for steady state, and 
Total_C_10000 and enter results in Table 
16.1, below. 
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Ex 1.4. Tune the carbon input and run 
TECO spin-up using the SASU approach in 
CarboTrain: 


Select Unit 4 

. Select Exercise 1 
Select High carbon input SASU 

. Select Output Folder 
Press Open source code 
Change the input_fluxes to 0.0000449 at 
line 46 and save the code 
Press Run Exercise 

. Check results in Output Folder. Time- 
dependent variation in carbon input, each 
pool size, residence time, total ecosystem 


ap eae oe 
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study whether or how the parameters will influ- 
ence the spin-up efficiency and equilibrium. 

Run ND (default) spin-up for the TECO matrix 
model with increasing C input and passive pool 
turnover rate (Ex 1.3 and 1.5), then change the 
spin-up approach to SASU (Ex 1.4 and 1.6). Learn 
whether carbon storage driven by the two spin- 
up approaches converges to the same steady state. 
How is spin-up efficiency affected by different 
parameters or carbon input? Complete Ex 1.3-1.6 
and complete the remaining rows of Table 16.1. 


carbon storage, total ecosystem carbon 
storage capacity, and carbon storage 
potential in output .x1s. Figures are 
in results.png. Take note of years 
for steady state, and Total_C_10000 
and enter results in Table 16.1, below. 


QUESTIONS: 


Compared to the default cases (Ex 1.1 and 1.2), do 
carbon input increases change the spin-up time for 
steady state? Do carbon storages using the two spin- 
up approaches still converge when carbon input 
increases? Are the steady states the same when car- 
bon inputs are higher? Why/why not? 


Ex 1.5. Tune the passive soil turnover 
rate and run TECO spin-up using the ND 
approach in CarboTrain: 


Select Unit 4 

. Select Exercise 1 
Select High K passive ND 

. Select Output Folder 
Press Open source code 
Change the last column in variable temp 
to 0.00000309564 at line 34 and save 
the code 
Press Run Exercise 

. Check results in Output Folder. Time- 
dependent variation in carbon input, each 
pool size, residence time, total ecosystem 
carbon storage, total ecosystem carbon 
storage capacity, and carbon storage 
potential in output .x1s. Figures are 
in results.png. Take note of years 
for steady state, and Total_C_10000 
and enter results in Table 16.1, below. 
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Ex 1.6. Tune passive soil turnover rate and run 
TECO spin-up using the SASU approach in 
CarboTrain: 


Select Unit 4 

Select Exercise 1 

Select High K passive SASU 

„ Select Output Folder 

Press Open source code 

Change the last column in variable temp 
to 0.00000309564 at line 34 and save 
the code 

Press Run Exercise 

. Check results in Output Folder. Time- 
dependent variation in carbon input, 
each pool size, residence time, total eco- 
system carbon storage, total ecosystem 
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SPIN-UP IN A WEAK NONLINEAR SYSTEM 


Nonlinearity exists in terrestrial biogeochemical 
models. For example, the canopy photosynthesis 
(i.e., carbon input) is usually dependent on leaf 


EXERCISE 2 


Run ND (default) spin-up for the TECO matrix 
model with foliage nonlinearity and soil turn- 
over rate nonlinearity (Ex 2.1 and 2.3), then 
change the spin-up approach to SASU (Ex 2.2 
and 2.4). Learn whether carbon storage using 
the two alternative spin-up approaches con- 
verges to the same steady state. How long is the 
spin-up time for a model with nonlinearity in 
foliage or soil? Complete Ex 2.1—2.4 and enter 
results to complete Table 16.2. 


def input_fluxes(t, y): 


carbon storage capacity, and carbon stor- 
age potential in output .x1s. Figures 
arein result .png. Take note of years 
for steady state, and Total C_ 10000 
and enter results in Table 16.1, below. 


QUESTIONS: 


Compared to the default cases (Ex 1.1 and 1.2), do 
increases in passive soil turnover rate change the 
spin-up time for steady state? By comparing the 
delta_C in result .png, are results from SASU 
more stable? Do carbon storages using the two 
spin-up approaches still converge when passive 
soil turnover rate increases? Are the steady states 
the same when soil carbon turnover rate is higher? 
Why/why not? 


area, which is a function of leaf carbon pool size. 
In some models, the soil carbon turnover rate is 
pool size dependent. The following exercise stud- 
ies how nonlinearity impacts the spin-up. 


Ex 2.1. Enable foliage nonlinearity in TECO 
and then run TECO using the ND spin-ip in 
CarboTrain: 


Select Unit 4 
. Select Exercise 2 
Select Nonlinear foliage C with ND 
„ Select Output Folder 
Press Open source code 
Add following lines at line 53 (remem- 
ber to copy the indentation as well): 


Dp mea gp S 


tmp = 0.00000449 + 0.00002245 / 2 * (1 - np.exp(-0.5 * y[0] * 0.008)) / 0.5 


return tmp 
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TABLE 16.2 
Comparison of results from Exercise 1 and 2 


Name Description input ki Cine Crooo0 a 
Ex 1.1 Default ND 0.00002245 1.5478 x 10% 0 
Ex 1.2 Default SASU 0.00002245 1.5478 x 10% 0 
Ex 2.1 Nonlinear foliage C with ND (Crear) 1.5478 x 10% 0 
Ex 2.2 Nonlinear foliage C with SASU (Crear) 1.5478 x 10% 0 
Ex 2.3 Nonlinear soil C with ND 0.00002245  £(C assoi) 0 
Ex 2.4 Nonlinear foliage C with ND 0.00002245  £(C assoi) 0 
Note. Cinpu = carbon input fluxes (gC m™ s~'); k assoi = turnover rate of soil carbon (day~'); Cin = initial total carbon pool size 
(gC m™); Cioooo = total carbon storage at year 10000 (gC m”); Teq = spin-up convergence time (year). f(Cijy) = nonlinear 


relation between carbon input flux and foliage carbon (unitless) (see Ex 2.1 or Ex 2.2 for details). f(C pssi) = nonlinear relation 


between passive soil turnover rate and passive soil carbon storage (unitless) (see Ex 2.3 or Ex 2.4 for details). 


The lines added here describe the nonlinear 
relation between carbon input fluxes and foli- 
age carbon (see Figure 16.1), which is formu- 
lated by: 


Cor =f (Ciatge) 
i= eu Ctaliage “SLA a 6, 1) 


= mtm y Cio i 


ka «LAT 
where Cum, min = 0.00000449 gC ms), 
Cito = 0.00002245 gC ms", k, = 0.5, SLA 


= 0.008 m*(gC)*, LAI, = 2 m*m?, Ciojiyg îs the 
foliage carbon. 


g. Press Run Exercise 

h. Check results in Output Folder. Time- 
dependent variation in carbon input, 
each pool size, residence time, total 
ecosystem carbon storage, total eco- 
system carbon storage capacity, and 
carbon storage potential in output. 
xls. Figures are in results.png. 
Take note of years for steady state, and 
Total_C_10000 and complete the third 
row of Table 16.2, below. 


def input _fluxes(t, y): 


Carbon input (gC m? s”!) 


0 200 400 600 800 
Foliage carbon (gC m?) 


1000 1200 


Figure 16.1. The nonlinear relation between carbon 
input and foliage carbon. 


Ex 2.2. Enable foliage nonlinearity in 
TECO and then run TECO spin-up using SASU 
approach in CarboTrain: 


Select Unit 4 
„ Select Exercise 2 

Select Nonlinear foliage C with SASU 
. Select Output Folder 

Press Open source code 


eu (A fer © 


Add following lines at line 53 (remember to 
copy the indentation as well): 


tmp = 0.00000449 + 0.00002245 / 2 * (1 - np.exp(-0.5 * y[0] * 0.008)) / 0.5 


return tmp 
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The lines added here were explained in Step 
fofEx 2.1. 


f. Press Run Exercise 

g. Check results in Output Folder. Time- 
dependent variation in carbon input, 
each pool size, residence time, total 
ecosystem carbon storage, total eco- 
system carbon storage capacity, and 
carbon storage potential in output. 
xls. Figures are in results.png. 
Take note of years for steady state, and 
Total_C_10000 and enter results in Table 
16.2, below. 


QUESTIONS: 


Compared to the default cases (Ex 1.1 and 1.2), 
does foliage nonlinearity increase the spin-up time 
for steady state? Does carbon storage using the two 
spin-up approaches still converge when foliage 
nonlinearity is enabled? Are the steady states the 
same between the two spin-up approaches? 


Ex 2.3. Enable nonlinearity in passive soil 
turnover rate and run TECO using the ND 
spin-up approach in CarboTrain: 


Select Unit 4 

Select Exercise 2 

Select Nonlinear soil C with ND 

. Select Output Folder 

Press Open source code 

Replace line 53 with the following lines 
(remember to copy the indentation as 
well): 


moan oP 


def fun_K(t, y): 
K[6][6] = temp[6] /86400 * (y[6] / 
10000.0 + 0.1) 
return K 
mod = GeneralModel (times, B, A, fun_K, iv_ 
list, input_fluxes) 


The lines revised here describe the relation 
between passive soil turnover rate and passive 
soil carbon (see Figure 16.2), which is formu- 
lated by: 


E asssoi 
[vessel = Cosas) = ko a (= +0. 1) 


(16.2) 


x 106 


10 


Carbon turnover rate (day!) 
+= 


0 10000 20000 30000 40000 50000 60000 
Passive soil carbon (gC m?) 


Figure 16.2. The relation between passive soil turnover 
rate and passive soil carbon 


where k, = 0.00000154782 day", C 
the passive soil carbon storage. 


passsoil 1S 


g. Press Run Exercise 

h. Check results in Output Folder. Time- 
dependent variation in carbon input, 
each pool size, residence time, total 
ecosystem carbon storage, total eco- 
system carbon storage capacity, and 
carbon storage potential in output. 
xls. Figures are in results.png. 
Take note of years for steady state, and 
Total_C_10000 and enter results in Table 
16.2, below. 


Ex 2.4. Enable nonlinearity in passive soil 
turnover rate and run TECO using the SASU 
spin-up in CarboTrain: 


Select Unit 4 

Select Exercise 2 

Select Nonlinear soil C with SASU 

Select Output Folder 

Press Open source code 

Replace line 53 with the following lines 
(remember to copy the indentation as 
well): 


E 


def fun_K(t, y): 

K[6][6] =temp[6] /86400* (y[6] /10000.0 
+ 0.1) 

return K 

mod = GeneralModel(times, B, A, fun_K, iv_ 
list, input_fluxes) 


The lines revised here were explained in Step f 
of Ex 2.3. 
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g. Press Run Exercise 

h. Check results in Output Folder. Time- 
dependent variation in carbon input, each 
pool size, residence time, total ecosys- 
tem carbon storage, total ecosystem car- 
bon storage capacity, and carbon storage 
potential in output .x1s. Figures are 
in results.png. Take note of years 
for steady state, and Total_C_10000 
and enter results in Table 16.2, above. 


QUESTIONS: 


Compared to the default cases (Ex 1.1 and 1.2), 
does nonlinearity in soil turnover rate increase 
the spin-up time? Does carbon storage using the 
two spin-up approaches still converge when non- 
linearity in soil turnover rate is enabled? Are the 
steady states the same between the two spin-up 
approaches? 
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This chapter provides an overview of the trace- 
ability framework as a method for identifying the 
key sources of uncertainty in land carbon cycle 
modeling. The analytical framework is built upon 
derivations of the matrix equation introduced in 
Chapter 1.This framework provides a systematic way 
to decompose the uncertainty of land carbon cycle 
projections into traceable components. Thus, trace- 
ability analysis can facilitate the understanding of 
carbon cycling, its drivers and controlling processes 
within a model and across models with different 
carbon-cycle structures, external climate forcings, 
and scientific assumptions. This chapter also intro- 
duces several examples to demonstrate the applica- 
tion of traceability analysis to model evaluation. 


A KEY CHALLENGE FOR EARTH SYSTEM 
MODELS: IDENTIFICATION OF UNCERTAINTY 
SOURCES 


Carbon cycle schemes incorporated in Earth 
system models (ESMs) and offline land surface 
models are widely used to simulate terrestrial bio- 
geochemistry and its feedback to climate change. 
The processes underpinning the terrestrial carbon 
(C) cycle constitute one of the most uncertain 
components, and this uncertainty has become a 
bottleneck in Earth system modeling. The model- 
to-model variation in future projections of the 
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global land C sink remained large from the third 
assessment report of the Intergovernmental Panel 
on Climate Change (IPCC), published in 2001, 
to the fifth report, published in 2017, despite an 
intervening period of over 15 years of research 
and model development. The new ESM projec- 
tions and analysis in Phase 6 of the Coupled 
Model Intercomparison Project (CMIP6) show 
that uncertainty on the recent past and potential 
future evolution of the land carbon cycle, relevant 
to the assessment and understanding of global 
climate change, is still a prominent feature in the 
sixth IPCC report, published in 2021. Thus, one 
key challenge for Earth system science is how to 
reduce the large disagreement in carbon cycle pre- 
dictions among models of the terrestrial biosphere 
incorporated in ESMs. 

To tackle this challenge, we need to answer one 
question: why are terrestrial carbon cycle predic- 
tions different among models? In the past three 
decades, numerous model intercomparison projects 
(MIPs) have been established, in part to shed light 
on this question. Although those MIPs have made 
important contributions to model evaluation and 
synthesis, they have been limited by design in their 
ability to explore the sources of model uncertainty. 
Thus, our question can be refined into how we can 
trace model uncertainty back to its sources, such as 
model structures, parameters and external forcings. 
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TRACEABILITY FRAMEWORK: DESIGN AND KEY 
COMPONENTS 


In Chapter 1 we noted that the terrestrial carbon 
cycle has four fundamental properties: (1) pho- 
tosynthesis as the primary carbon influx pathway; 
(2) the transfer of assimilated carbon among dif- 
ferent C pools (e.g., plant litter and soil); (3) the 
rate of transfer of this carbon controlled by the 
donor pool; and (4) the process of carbon transfer 
from the donor pool to the recipient pool, which 
can be described by first-order kinetics. These four 
fundamental properties, with minor variations, 
have emerged in most models using a pool-and- 
flux structure. Thus, the modeled terrestrial carbon 
cycle can be tracked by the equation: 


dx (1) 
dt 


dx (1) 
dt 


terrestrial carbon storage, u is the carbon inputs 


= Bu(t)+ AE (t)KX(t) (17.1) 


where the characterizes the dynamic of 


from photosynthesis, and B is the partitioning 
coefficients to live carbon pools (e.g., leaf, wood, 
and root). Thus, the term Bu(t) represents the par- 
titioning of the assimilated carbon from photo- 
synthesis among different plant carbon pools. The 
second term, A&(t)KX(t), describes the movement 
and exit rates of carbon atoms along their trans- 
ferring paths (Xia et al., 2013; Luo et al., 2017). 
A is a transfer coefficients matrix to represent the 
movements of carbon atoms among multiple car- 
bon pools. K represents the exit rates of different 
carbon pools. &(t) modifies the decay rate of dif- 
ferent pools by adding influences from environ- 
mental factors (temperature, moisture, nutrients, 
soil texture, and so on). Different models may have 
different combinations of plant and soil pools and 
different values in vector B or matrices A and C, 
but they can be unified by this matrix equation. 
At steady state, terrestrial carbon reservoirs reach 
their maximum storage capacity and there are no 
further net carbon exchanges. Therefore, the steady- 
state land carbon storage (X,) can be solved by let- 
ting Equation 17.1 equal 0 (i.e., d(X)/dt = 0): 
X, = (AER) | BU, (17.2) 


where X, is a vector containing the steady-state 
pool sizes of all carbon pools, and U,, is the carbon 
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influx at steady state. The term (—AEK)~'B mea- 
sures the ecosystem carbon residence time (Tp), 
which is an important ecosystem property involv- 
ing multiple processes, including carbon alloca- 
tion (the B matrix), carbon transfer network (the A 
matrix), decomposition processes (the K matrix) 
and modifications from environmental factors (the 
E matrix). Here, net primary productivity (NPP) is 
treated as carbon inputs, and the X, can then be 
decomposed into NPP and 7;: 


X, = NPP X Tg (17.3) 


As mentioned above, 7, is determined by four 
items: (—A€K)”'B. Here, A, K and B are the intrin- 
sic model properties, while € represents external 
influences from environmental factors. We thus 
rearrange these items and express T; as: 

Ze =(-A KB)" (17.4) 
We further merge (—A”'K”!B) and define it as 


the baseline carbon residence time (Tz). Then, Ty 
can be decomposed into Tg and €7!: 


ÎS Te (17.5) 
where Tg is determined by model intrinsic prop- 
erties, associated with plant trait, soil attributes, 
number of C pools and how C is transferred among 
these pools at different cycling rates. As defined 
above, É represents the modifying effects of envi- 
ronmental factors as a fractional value applied mul- 
tiplicatively to the baseline C residence time. If, as 
an example, we consider temperature and water 
availability as the main environmental factors scal- 
ing process rates in our system, we may divide é 
into a temperature scalar (£..) and a water scalar (8): 


5 = Grow 


Through the above mathematical rearrange- 
ments of Equation 17.2, the modeled land carbon 
storage at steady state is decomposed into its deter- 
minative components as illustrated in Figure 17.1. 
X, is first decomposed into NPP and 7,. Then, tT, can 
be further traced into 7; and é, while € is divided 
into €, and Ëw. This framework offers a systematic 
way to decompose the steady-state land carbon 
cycle in a form that fits the structure of many com- 


(17.6) 


mon land carbon models. Most recently, advances 
in terrestrial carbon cycle theory (Luo et al., 2017) 
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Figure 17.1. The traceability framework for decomposing steady-state terrestrial carbon storage into its traceable components. 


have extended the capability of this framework to 
trace transient-state land carbon storage. This is 
explored further in Chapter 18. 


BENEFITS OF TRACEABILITY ANALYSIS FOR 
IDENTIFYING MODEL UNCERTAINTY SOURCES 


Land carbon cycle schemes in most Earth system 
models can be mathematically represented in 
the form of Equation 17.1. Thus, the traceability 
framework can facilitate the evaluation of land 


CASE 1: Understanding Terrestrial Carbon 
Cycle Variations Within a Model 


Modern global land-surface models account 
for a vast array of different processes, mak- 
ing it difficult to trace features of the results 
of simulations to the model’s constituent pro- 
cesses and interactions to aid understanding 
and evaluation. For example, all global land 
carbon models simulate the spatial variability 
of ecosystem C storage, but it is often unclear 
how the geographic distribution of ecosystem 
C storage across a global map is determined. 
Xia et al. (2013) introduced the framework 
of traceability analysis and applied it to the 
Australian Community Atmosphere Biosphere 
Land Exchange (CABLE) model to help under- 
stand differences in modeled carbon processes 
among biomes and as influenced by nitrogen 
processes. 


JIANYANG XIA 


carbon cycle models and the ESM frameworks to 
which they are coupled by identifying the sources 
of simulation difference within and between mod- 
els. It should be noted that the uncertainty of land 
carbon cycling in ESMs not only stems from model 
structure and parameters, but also is affected by 
climate outputs from the atmospheric components 
of ESMs (Ahlström et al., 2017). Below are some 
cases which have shown the application of trace- 
ability analyses for model evaluation. More appli- 
cation cases are summarized in Chapter 18. 


They first estimated ecosystem carbon capac- 
ity among different biomes in CABLE and 
traced it to the ecosystem residence time (Tg) 
and carbon influx (i.e., NPP or U, in Equation 
17.1). Results indicated that different biomes 
showed different patterns. For example, ever- 
green broadleaf forests have a very high NPP 
but a short carbon residence time. The tundra 
biome has a very low NPP but a long carbon 
residence time. Some barren biomes, such 
as deserts, have low carbon storage capacity 
because of low NPP and short carbon residence 
time (Figure 17.2). Using Equation 17.5 they 
then further decomposed the ecosystem carbon 
residence time into baseline carbon residence 
time and environmental scalars. They found 
that tundra and evergreen broadleaf forests 
have similar baseline carbon residence times, 
but the environmental scalars are much lower 
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Figure 17.2. Determinants of ecosystem C storage capacity by NPP and residence time. Values of all grid cells are plotted 


in panel (a). In panel (b), the hyperbolic curves represent constant values of ecosystem carbon storage capacity. ENF — 
Evergreen needle leaf forest, EBF — Evergreen broadleaf forest, DNF — Deciduous needle leaf forest, DBF — Deciduous 


broadleaf forest, Shrub — Shrub land, C3G — C3 grassland, C4G — C4 grassland, and Barren —barren/sparse vegetation. 


in tundra than evergreen broadleaf forests. The 
lower value of the environmental scalar sig- 
nifies stronger environmental limitations on 
decomposition rates of organic carbon, which 
implies that tundra has much longer actual car- 
bon residence time than evergreen broadleaf 
forests, even though the baseline residence time 


Because the CABLE model can be applied with 
or without carbon-nitrogen coupling enabled, the 
traceability analysis can be used to evaluate how 
the incorporation of nitrogen cycle processes can 
affect carbon cycling within the model. The analysis 
showed that incorporating the nitrogen cycle into 


CASE2: Intermodel Comparisons of 
Terrestrial Carbon Cycle Simulations 


Although model intercomparison projects 
have shown that the ensemble means of mul- 
tiple models fit data well, the intermodel dif- 
ference in carbon cycle pools and fluxes is 
usually large. To better understand the sources 
of variations in modeled carbon storage 
capacity among models, Rafique et al. (2016) 
used the traceability framework to compare 
two land carbon cycle models, CLM-CASA” 
and CABLE. As shown in Figure 9.2, CABLE 


is similar. The environmental space of temper- 
ature and water scalars among biomes in the 
CABLE model was then considered, which led 
to the insight that the spatial difference in envi- 
ronmental control of carbon residence time is 
mainly driven by the temperature scalar rather 
than the water scalar in CABLE. 


CABLE reduces ecosystem carbon storage capacity 
in all biomes in comparison with the carbon-only 
model. Specifically, the CABLE model simulates 
lower NPP in woody biomes but shorter carbon 
residence time for non-woody biomes when the 
nitrogen cycle is switched on (Xia et al., 2013). 


and CLM-CASA’ both showed a distinctive 
structure of the ecosystem carbon cycle, like 
the number of carbon pools, NPP partitioning 
coefficients, decomposition rates and carbon 
transfer coefficients among different pools, 
which resulted in different baseline carbon 
residence times. Due to more NPP partition- 
ing into roots and wood, the baseline carbon 
residence time in CABLE was longer than 
in CLM-CASA”. Despite difference in model 
structure, the traceability analysis showed 
that CABLE and CLM-CASA” simulate similar 
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global mean soil carbon storage capacity. This 
is because CABLE has lower NPP but longer 
carbon residence time compared to CLM- 
CASA”. The longer carbon residence time in 
CABLE mainly results from the baseline car- 
bon residence time rather than the environ- 
mental scalars. 

Overall, this case indicated that the major 
factors contributing to the differences between 


CASE 3: Evaluating Impacts of External 
Forcings on Terrestrial Carbon Cycle 
Simulations 


One important uncertainty source in land car- 
bon cycle studies stems from the differences in 
baseline and projected climate fields generated 
by different climate and Earth system models. 
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ERA ee vase sală 
| CanESM2 INM-CM4 
| CCSMA IPSL-CM5A-LR 
CSIRO-Mk MIROC5 
1 EC-EARTH MPI-ESM-LR 
1 GFDL-ESM2M  MRI-CGCM3 
1 GISS-E2-R NorESM1-M 
' HadGEM2-ES 

(a) (b) 

2400 

(0) 

£ 2300 

£ 2200 

5 

g 2100 

o 


2000 


2000 2100 2200 2300 2400 
Coco LPJ-GUESS (Pg C) 


1850 1900 1950 2000 2050 2100 


the two models were primarily due to param- 
eter settings related to photosynthesis, carbon 
input, baseline residence times, and environ- 
mental conditions. The application of traceabil- 
ity analysis to intermodel comparisons is useful 
for developing a model with different versions 
or interpreting the different predictions of land 
carbon cycle by different models under the 
same climate forcings. 


When a land carbon model is forced by out- 
put from different climate models, the resul- 
tant predictions of the land carbon cycle also 
differ — a source of uncertainty propagating 
through the carbon cycle model from the forcing 
climate model. By applying traceability analy- 
sis on the LPJ-GUESS model when forced with 


— transient 
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Figure 17.3. Emulator performance and example of experimental design. (a) Comparison between global ecosystem 
carbon stock (C.o) at steady state as simulated by LPJ-GUESS and emulator solution of global C.. at steady state for the 
13 simulations and emulator solutions. The small deviations from the 1:1 relationship (black line) are mainly due to the 


eco 


internal variability in LPJ-GUESS where a true steady state is never reached, in contrast to the emulator, which solves the 
steady-state conditions analytically. (b) Illustration of experimental design. Gray curves are time trajectories of Ce. from 
transient simulations with LPJ-GUESS; circles denote corresponding emulator-computed steady-state values of Ceo: (C) 
Global partitioning of steady-state Ceco 


difference in modeled land carbon uptake, accounting for almost 50% of variability among simulations with different 


uncertainties. From this panel, we can see that NPP is the largest contributor to the 


forcings. Figure reproduced from Ahlstróm et al. (2015). 
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13 different climate datasets from CMIP5 gen- 
eral circulation models, Ahlström et al. (2015) 
studied the impact of climate model uncer- 
tainty on the land carbon cycle. LPJ-GUESS is 
a global dynamic vegetation-ecosystem model 
that is based on detailed representation of veg- 
etation structure, demography, and resource 
competition. To understand how ecosystems 
around the world respond to future projections 
of atmospheric CO, concentrations and climate, 
transient and steady-state simulations were per- 
formed forced by output fields from different 
climate models under the RCP8.5 radiative 
forcing scenario. The authors then quantified 
the relative contributions of the three groups 
of processes — NPP, vegetation dynamics and 
turnover, and soil decomposition — to future 
carbon uptake uncertainties. They achieved this 


by fitting the traceability framework as an emu- 
lator to each of the 13 LPJ-GUESS simulations. 
Because the emulator has a common structure, 
it was possible to ‘exchange’ the key carbon 
cycle processes among the 13 simulations, 
allowing the importance of each in explaining 
the variability among simulations to be derived 
(Figure 17.3). Further details about the model 
and simulations can be found in Ahlström et al. 
(2015). Since there is only one carbon model 
involved in this study, we can be sure that all the 
differences in the carbon cycle projections stem 
from the external climate forcing. The results 
showed that NPP, vegetation turnover, and soil 
decomposition rate respectively explain 49%, 
17%, and 33% of the uncertainty in carbon 
uptake by terrestrial ecosystems globally under 
the RCP8.5 future scenario (Figure 17.3c). 


The traceability framework could be applied in 
a similar way to analyze the carbon cycle impacts 
of other types of external forcings, such as differ- 
ent CO, scenarios, disturbance regimes, and land 


processes are most important in determining the 
model uncertainty under given external forcings. 
This in turn can identify processes for priority 
attention in evaluating, revising or improving the 


use/cover changes. As shown in Ahlstrâm et al. 
(2015), this approach can help to diagnose which 
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CASE 4: Assessment of New Processes in 
Land Carbon Models 


Although the integration of more process detail 
into a model can increase its utility, it also 
tends to increase the number of interacting 
processes and feedbacks influencing the model 
output, and the relationship of the model out- 
put to the model forcing data. As a result, it 
can become more difficult to understanding or 
evaluate how the new incorporated processes 
influence the behavior and performance of the 
model. For example, recognizing the important 
role that nitrogen (N) availability plays for the 
dynamics of the world’s ecosystems, many cur- 
rent models that originally included only a car- 
bon cycle have been enhanced to incorporate a 
nitrogen cycle that interacts with the model’s 
carbon processes and state. The availability of 
nitrogen can strongly affect both ecosystem 
carbon input and mean carbon residence time. 
Nevertheless, the detailed structure of the C-N 


carbon cycle model. 


coupling scheme varies greatly among different 
models. How these diverse representations of 
C-N interactions affect carbon cycle modeling 
remains unclear. Thus, Du et al. (2018) incor- 
porated three different C-N coupling schemes, 
derived from the TECO-CN, CLM4.5, and 
O-CN models, into the carbon-only version of 
the Terrestrial ECOsystem (TECO) model and 
then used the traceability framework to evalu- 
ate their impacts on the carbon cycle. The three 
C-N coupled frameworks are different in many 
aspects, including C:N stoichiometry in plants 
and the soil, plant N uptake strategies, down- 
regulation of photosynthesis under N deficit, 
and the pathways of N acquisition. 

As shown in Figure 17.4a, each N process 
is simulated with different assumptions by dif- 
ferent models. The results showed that each of 
the integrated C-N coupling schemes reduced 
the carbon storage capacity compared with 
the carbon-only version of TECO. However, the 
magnitude of the reduction varies among the 
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Figure 17.4. Schematic diagram of the terrestrial ecosystem carbon and nitrogen coupling model and its traceable 
components under different C-N coupling schemes. (a) The major carbon and nitrogen pool-and-flux structure in a 
terrestrial ecosystem, with alternative assumptions of the N processes represent in SM1 (TECO-CN), SM2 (CLM4.5), 
and SM3 (O-CN) C-N coupling schemes. Light blue arrows indicate C-cycle processes and red arrows show N-cycle 
processes. Met./Str. litter — metabolic and/or structural litter; SOM — soil organic matter. *Set N fixation as an option 
when the plant N uptake is not enough for growth in terms of C investment in SM1, but go directly to soil mineral N 
pool in SM2 and SM3. (b) Simulation of annual ecosystem carbon storage capacity for 1996 to 2006 at Duke Forest by 
carbon in flux (NPP, x-axis) and ecosystem residence time (Tp, y-axis) in the TECO model framework with three C-N 
coupling schemes (SM1, SM2, and SM3) and in the TECO C-only model (C). The inserted panel in the left bottom 
corner shows the 7, in SM1, SM2, SM3, and the C-only model; The top right panel shows the mean ecosystem carbon 
storage simulated among SM1, SM2, SM3, and the C-only model; The right bottom panel shows the relative change in 
the simulated NPP and 7, among the three schemes compared with the C-only model. 


three schemes, i.e., -23%, —30%, and -54% TECO indicate that adding interactive nitrogen 


for TECO-CN, CLM4.5 and O-CN, respectively. 
The reduced carbon storage capacity was driven 
by reduced NPP (-29%, —15%, and -45%) 
and mean C residence time (9%, -17%, and 
-17%) (Figure 17.4b).The differences in these 
results for different N cycle implementations in 


CASE 5: Accelerating the Pace of Model 
Evaluation Via Online Tools 


Increasing model complexity not only leads to 
a large divergence in simulations of the land 
carbon cycle by different models, but also tends 
to increase the computational consumption of 
model runs, which can become a bottleneck 
for model evaluations. As illustrated above, the 
traceability framework allows a carbon cycle 
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dynamics to a carbon cycle model could gener- 
ate new uncertainty sources for carbon cycle- 
climate feedbacks. This example illustrates that 
the traceability analysis can improve our under- 
standing of the impacts of newly incorporated 
processes on an existing carbon cycle model. 


model to be simplified and generalized into 
several traceable components, which can be 
further decomposed to quantify the structural 
sources of the uncertainty built into the full 
version of the model. Thus, an online trace- 
ability analysis system for model evaluation 
(TraceME) was built to accelerate the pace of 
model evaluation of the land carbon cycle, with 
Earth system models in mind. 
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The TraceME system is a cloud-based plat- 
form (Zhou et al., 2021). It provides user- 
friendly interfaces and scientific workflow to 
automatically perform the model evaluation. 
On the website (http: //www.traceme.org.cn/, 
last access: October 2020), the user can select 
the data of interest, and submit the task through 
the browser to complete the traceability analy- 
sis. The collaborative framework of TraceME 
provides a convenient data-sharing platform 
for all users to filter data of interest from the 
entire system for traceability analysis. Once a 
task is requested through a web browser, the 
scientific workflow of TraceME is triggered 
and it will execute the corresponding pro- 
cesses, such as data preprocessing, traceability 


SUMMARY 


The traceability analysis provides a relatively new 
approach to the evaluation of model uncertainties 
impacting simulations of the terrestrial carbon 
cycle. The framework builds on the fundamental 
properties of the terrestrial carbon cycle, which are 
reflected broadly by different models despite dif- 
ferences in structure and process detail. Equation 
17.1 provides the theoretical basis for the trace- 
ability analysis. The application cases described 
illustrate how traceability analysis can benefit the 
understanding of variations in terrestrial carbon 
cycling within a model, the intercomparison of 
terrestrial carbon cycling among models, evalua- 
tion of the contributions of external forcings to 
carbon-cycle uncertainty, the assessment of newly 
incorporated processes into carbon cycle models, 
and the development of online tools for quick 
and consistent model evaluation. The application 
cases presented in this chapter mainly focus on the 
steady-state ecosystem C storage. In Chapter 18 we 
will explore how the traceability analysis can be 
adapted to to apply to the transient dynamics of 
the land carbon cycle in models. 

The traceability framework is developed for 
the terrestrial carbon cycle, but it can be extended 
to include nutrient and water processes. Further 
applications include the integration of benchmark 
analysis (see Chapter 19) with traceability analy- 
sis, the connection between structurally trace- 
able components and model parameters, and the 
application of traceability analysis to understand 
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analysis, and evaluation. The submitted data 
will be systematically decomposed into trace- 
able components for quantifying the variance 
contributions of these components to land 
carbon dynamics. After the submitted task is 
completed, TraceME provides a visual inter- 
face to show and download the results in the 
forms of figures and Network Common Data 
Form-format (NetCDF) files. These files can be 
used to perform further analysis. TraceME is a 
convenient tool to evaluate models based on 
the traceability analysis framework. It has, for 
example, been applied to evaluate land carbon 
dynamics in CMIP6 Earth system models (Zhou 
etal., 2021). 


climate-carbon cycle feedbacks in coupled land- 
atmosphere simulations with Earth system models. 
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QUIZZES 


1. In the steady-state traceability analysis, which 
variable is first decomposed into NPP and car- 
bon residence time? 

a. Soil C stock 

b. Vegetation biomass 

c. Ecosystem total C stock 

d. Ecosystem C storage capacity 


2. When the nitrogen cycle is incorporated into a 
carbon cycle model, which components in the 
terrestrial carbon cycle will be changed? What 
happened in the CABLE model? 


3. Which model has the longer ecosystem carbon 
residence time, CABLE or CLM-CASA’? Why? 


4. Describe the external forcings which can con- 
tribute to the large model uncertainty of the ter- 
restrial carbon cycle. 


5. How do TECO-CN, CLM4.5, and O-CN models 
differ in their approach to simulating the cou- 
pling between the terrestrial carbon and nitro- 
gen cycles? How do these differences impact 
carbon cycle dynamics in the TECO model? 
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This chapter introduces the traceability analysis of 
transient carbon storage, which is a modification 
of the original traceability framework that relies on 
an assumption of steady state. Transient traceability 
analysis is particularly useful to address the origin 
and drivers of carbon flow of a system in a state 
of transition towards a new steady state — a tran- 
sient system. We will illustrate how the transient 
traceability can be applied to address scientific 
questions with two examples. One tracks the dif- 
ferences in modeled carbon storage between two 
forest ecosystems, the second compares and con- 
trasts outcomes of a set of Model Intercomparison 
Projects (MIPs) encompassing multiple land car- 
bon models. 


INTRODUCTION 


Simulations by Earth system models in the Coupled 
Model Intercomparison Phase 5 project (CMIP5) 
showed that differences among the models entail 
large uncertainty in land C storage (Jones et al. 
2013).The spread of simulated future land C change 
across the models is even greater than the spread 
across the four radiative forcing scenarios, when 
the ensemble averages for each scenario are com- 
pared. Arora et al. (2020) compared model results 
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from two CMIP phases (CMIP6 vs. CMIP5) and they 
found that the model mean values of the carbon- 
concentration and carbon-climate feedback param- 
eters and their multi-model spread under a 1% per 
year CO, increase experiment has not changed sig- 
nificantly across the two CMIP phases for both land 
and ocean. Moving forward, a big challenge is how 
to understand and reduce the uncertainty across 
models to achieve more reliable predictions. 
Although land C models have become increas- 
ingly complex in recent decades with more and 
more processes incorporated, most current mod- 
els have the same theoretical foundation and there- 
fore share some of the same general properties, as 
described in chapters 1 and 2. These shared theo- 
retical foundations and properties enable many 
land C models to be represented or approximated 
in matrix form, as demonstrated in chapter 5. With 
the matrix representations of the C cycle models, 
we are able to decompose the modeled land C stor- 
age into different traceable components, which is 
the unified diagnostic system for uncertainty anal- 
ysis, an overview of which was provided in chap- 
ter 9. Based on the common properties and matrix 
representations of land C models, Xia et al. (2013) 
developed a traceability framework to decompose 
steady-state C storage into traceable components. 
Chapter 17 describes how the framework could be 
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applied to investigate differences in carbon storage 
across biomes and how differences in model struc- 
ture influence the simulated effects of nitrogen 
cycling and land use change on the carbon cycle. 

The traceability framework in its original form 
depends on a steady-state assumption for ecosys- 
tem carbon stocks. However, due to climate change 
and disturbance history, most ecosystems are not at 
steady state. The non-steady state is called transient 
state and the challenge is how to trace transient C 
storage dynamics. In this chapter, we first intro- 
duce a general framework for transient traceabil- 
ity analysis. Then, we illustrate how the transient 
traceability framework can be applied to address 
scientific questions using two examples. Our first 
example uses transient traceability analysis to track 
differences in C storage between two forest eco- 
systems. The second applies transient traceability 
analysis to three model intercomparison projects 
(MIPs) to identify the sources for the uncertainty 
in modeled carbon storage dynamics within each 
MIP and across the three MIPs. 


ATRACEABILITY FRAMEWORK FOR TRANSIENT 
LAND CARBON STORAGE DYNAMICS 


In order to realize the traceability of transient C 
storage, Luo et al. (2017) conducted a theoreti- 
cal analysis to extend the steady-state traceability 
framework to work for systems in a transient state. 
The key was to add another term called C stor- 
age potential. It was introduced in chapter 9 and 
is further discussed below. The core equation for 
traceability of transient C storage is from Luo et al. 
(2017) and as follows: 


X(t) =(—A&(t)K) | B(t)u(t) 
—(-A&(t)k) x 


where X(t) is individual pool size at time t, which 
is a vector in a multi-pool model; A is a matrix 
of transfer coefficients between C pools; £(t) is a 
diagonal matrix of environmental scalars to reflect 
the control of physical and chemical properties, 
e.g., temperature, moisture, nutrients, litter qual- 
ity, and soil texture, on C cycle processes; K is a 
diagonal matrix of exit rates from donor pools, 
which encapsulates mortality rates for plant pools 
and decomposition coefficients for litter and soil 
pools; B is a vector of allocation coefficients of C 
input to each pool; u(t) is C input, i.e., NPP or GPP; 


(18.1) 


and X’(t) is net change of any individual C pool at 
time t, which is a vector for a multi-pool model. 
The sum of X’ of all individual C pools corre- 
sponds to net ecosystem production (NEP), or the 
sign opposite of net ecosystem exchange (NEE). 
In this equation, the inverse of the prod- 
uct of —A, €(t) and K, i.e., (-A&(t)K)~!, is named 
chasing time, which is a matrix representing 
the timescale for the net C pool change to be 
redistributed in the network consisting of all C 
pools. The product of chasing time and the allo- 
cation coefficient B, i.e., (-A€(t)K)~'B(t), is the 
residence time of individual pools. The product 
of the residence time and C input is the maxi- 
mum C that individual pools or the whole eco- 
system can store at a time, which is defined as 
C storage capacity, X,, that is, X. = (—A&(t)K)7! 
B(t)u(t). The second term in equation 18.1 — the 
product of chasing time and net C pool change 
(CAE(DR)X (0) — represents redistribution of 
net change of individual C pools in the network. 
This redistribution of net C pool change indicates 
the potential of an individual pool or the whole 
ecosystem to gain or lose C. Therefore, it is named 
as C storage potential, X,. So, we can derive another 
equation from the above descriptions as follows: 


(18.2) 


This is the overall equation of the transient 
traceability framework of land carbon storage. To 
demonstrate how this transient traceability frame- 
work is a useful tool, we will walk through two 
example applications. The first application uses the 
framework exactly as described above to investi- 
gate the differences in carbon storage between two 
forest ecosystems, Duke Forest and Harvard Forest, 
and analyze the underlying mechanisms that 
explain these differences. The second application 
uses this general framework in combination with 
some other methods to decompose modeled land 
carbon storage in three MIPs: CMIP5, Trends in Net 
Land-Atmosphere Carbon Exchange (TRENDY), 
and Multiscale Synthesis and Terrestrial Model 
Intercomparison Project (MsTMIP). Decomposing 
differences in land carbon storage across models 
within each MIP and between the three MIPs helps 
us to understand why the models simulate dif- 
ferent outcomes, and what features of the model 
structure or parameters may be responsible for 
these differences. 
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TRANSIENT TRACEABILITY ANALYSIS OF 
CARBON STORAGE AT DUKE FOREST AND 
HARVARD FOREST 


In this case study, we will go over an applica- 
tion of the matrix approach to trace differences 
in carbon storage dynamics between Duke Forest 
and Harvard Forest. Duke Forest, located in North 
Carolina, USA (35%58'41"N, 79%5'39'W), is 
evergreen needleleaf forest and the dominant tree 
species at this site is Pinus taeda (loblolly pine). 
Harvard Forest, located in Massachusetts, USA 
(42°32'16"N, 72°10'17”W), is deciduous broad- 
leaf forest dominated by Quercus rubra (red oak) and 
Acer rubrum (red maple). These two study sites have 
contrasting ecosystem types and there are plenty 
of measured data at both sites available for calibrat- 
ing our model. 

The framework for transient traceability of land 
C storage dynamics is shown in Figure 18.1. Using 
this framework, we can decompose transient C 
storage into C storage capacity and C storage poten- 
tial. C storage capacity is the product of NPP and C 
residence time. C storage potential is the product 
of chasing time and net C pool change. Further, 
chasing time is jointly determined by environ- 
mental scalars for temperature and precipitation, 


transfer coefficients and exit rate. C residence time 
is jointly determined by the environmental scalars, 
transfer coefficients, exit rate, and allocation coef- 
ficients. Environmental scalars can be derived from 
climate forcing. 

The model used is the TECO model, which 
has been described in chapters 2 and 5. The pro- 
cedure for this transient traceability analysis is as 
follows. We first calibrate the TECO model with 
GPP data for the two sites downloaded from the 
AmeriFlux website (available at http://ameri- 
flux.Ibl.gov/). We then run TECO to steady state 
by recycling ten years of forcing data from 1850 
to 1859. Climate forcing data, including air and 
soil temperature, precipitation, photosynthetically 
active radiation, vapor-pressure deficit, and rela- 
tive humidity are derived from an offline run of 
the Community Land Model 4.5 (CLM4.5) for 
both historical (1850-2005) and RCP8.5 future 
(2006-2100) simulations. After that, we run the 
model in forward simulation mode from 1850 to 
2100. We output each component (X, A, é, K, B) 
in Figure 18.1 and calculate transient C storage 
using equations 18.1 and 18.2. We then verify the 
calculated transient C storage. That is, we compare 
direct model output of C storage with C storage 
calculated by the transient traceability framework. 
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Figure 18.1. Schematic diagram of the traceability framework to analyze transient carbon storage dynamics of terrestrial eco- 


systems. ¿yy and &, are water and temperature scalars, respectively. Dashed lines show the components that determine chasing 


time 7,,, (adapted from Jiang et al. 2017). 
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Figure 18.2. Correlation between direct model output of car- 
bon storage (X, the sum of all carbon pools) by the Terrestrial 
ECOsystem (TECO) model and calculated X by the traceability 
framework in Duke and Harvard Forests (adapted from Jiang 
et al. 2017). 


They are almost identical in both forests (Figure 
18.2), confirming that the transient traceability 
framework works very well to reproduce the full 
model simulations. 

The results for transient C storage, C storage 
capacity and C storage potential are shown in 
Figure 18.3. The trajectories of transient C storage 
(X, or rather, the sum of the elements [pools] of 
the vector X), C storage capacity, X,, and C storage 
potential, X, over time are similar between the two 
ecosystems, all increasing with time. Moreover, X 
closely tracks X, in both ecosystems and X, only 
accounts for a very small proportion of X, which 
indicates that transient C storage in these two eco- 
systems is predominated by the maximum C stor- 
age, i.e., carbon storage capacity (X.), while carbon 
storage in response to climate change is relatively 
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small. The most important difference between the 
sites is that Harvard Forest has higher X and X, than 
Duke Forest (Figure 18.3a). Panel b shows the 
change of the three variables in the two ecosys- 
tems by the end of 2100, which are averages of the 
last ten years’ C storage minus averages of the first 
ten years’ values. 

The components of transient C storage are shown 
in Figure 18.4. The two components of C storage 
capacity, that is, NPP and C residence time, are shown 
in panels a and b, respectively. NPP is similar between 
the two ecosystems, but residence time shows differ- 
ent trends. In Duke Forest, C residence time increases 
over time, but in Harvard Forest, C residence time 
decreases. Panel c shows the changes of NPP and C 
residence time in these two ecosystems at the end of 
the 21st century compared to 1850. 

Environmental scalars increase in both forests 
(Figure 18.4d and e), which signifies steadily 
decreasing environmental limitations on C pro- 
cesses. Allocation coefficients show different 
trends between the two ecosystems (Figure 18.4f). 
In Duke Forest, b, (allocation to leaf) and b, (allo- 
cation to root) both decline with time, but b, 
(allocation to wood) increases greatly. In Harvard 
Forest, allocation to leaves and wood both slightly 
increase, but allocation to roots decreases. The sub- 
stantial increase in wood allocation at Duke Forest 
may explain why C residence time in this ecosys- 


tem increases; wood usually has longer residence 
time than leaves and roots. Similarly, panel g shows 
the changes of allocation coefficients by the end of 
the 21st century in these two ecosystems. 

As shown in equation 18.1 and Figure 18.1, C 
storage potential, X,, is codetermined by the chas- 
ing time and net C pool change, X’, which equates 
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Figure 18.3. Transient total carbon storage (X, the sum of all carbon pools), carbon storage capacity (X.), carbon storage 
potential (X,) in Duke Forest and Harvard Forest and their changes by the end of the 21st century under the RCP8.5 scenario 


(adapted from Jiang et al. 2017). 
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Figure 18.4. Net primary production (NPP), ecosystem carbon residence times, environmental scalars, and allocation coef- 
ficients of NPP to leaf (b,), wood (b,) and root (b;) in Duke Forest and Harvard Forest, and their changes by the end of the 21st 
century (adapted from Jiang et al. 2017). 


to NEP or NEE at ecosystem scale. Figure 18.5 
shows the correlation between NEP and C storage 
potential in these two ecosystems. The correlation 
coefficients, R?, are high in both Duke Forest (0.80) 
and Harvard Forest (0.79). This indicates that X, 
is mostly determined by NEP rather than chasing 
time. Chasing time, represented by the slopes of 
the linear regressions between X, and NEP, is an 


indicator of approximate time needed for tran- -2 y y = 28.12x + 0.55, R? = 0.791 

A . . = 2 = 
sient C storage to reach C storage capacity. Chasing y = THOSE HOSS): = 0,802 
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years for the changed carbon pool to be redistrib- 


uted in the network. In Harvard Forest, this time is Figure 18.5. Correlation between net ecosystem produc- 


around 28 years. Having shorter chasing time, Xx, tion (NEP) and C storage potential (X,) in Duke Forest and 
in Duke Forest is lower than that in Harvard Forest. Harvard Forest (adapted from Jiang et al. 2017). 
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To summarize this case study, the transient trace- 
ability framework can decompose modeled tran- 
sient C storage into a few traceable components. 
This helps us to understand the mechanisms for 
modeled C dynamics in response to climate change 
in Duke Forest and Harvard Forest. For example, the 
difference in carbon storage capacity (X.) between 
the two ecosystems is mostly caused by the differ- 
ence in inherent C residence time. In addition, the 
contrasting responses of C residence time to climate 
change between the two ecosystems can be attrib- 
uted to the different responses of allocation of NPP 
to plant parts (leaves, wood, and roots). This appli- 
cation demonstrates that the traceability framework 
can be used to understand how and why different 
ecosystems respond to climate change differently. 
Similarly, it can also be used to address how other 
global change drivers (such as land use change 
and elevated CO,) affect land C storage dynamics 
across ecosystems in simulations with ecosystem 
models. When applied to global land models, it can 
also help investigate the differences across biomes 
under different environmental scenarios. 


TRANSIENT TRACEABILITY ANALYSIS 
OF LAND CARBON STORAGE IN MODEL 
INTERCOMPARISON PROJECTS 


Another important application of the transient 
traceability framework is to be used in MIPs to 
identify sources of uncertainty and thereby help 
improve model development. For example, Zhou 
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et al. (2018) applied the transient traceability 
framework but with a modified method to diag- 
nose the causes of uncertainty in modeled global 
annual land carbon storage within and across three 
MIPs: CMIP5, TRENDY, and MsTMIP. 

The transient traceability analysis of carbon 
storage at Duke Forest and Harvard Forest intro- 
duced above is called authentic transient traceabil- 
ity analysis, that is, the modeled differences of C 
storage among ecosystems or among models can 
be traced back to differences in respective compo- 
nents in equation 18.1, and finally to individual 
processes or parameters in the models. However, 
the application of authentic transient traceabil- 
ity analysis to MIPs requires much time to figure 
out the structures and parameterizations of all 
involved models in order to recode each in matrix 
form for a thorough model intercomparison. Due 
to the challenge in acquiring all the details of 
the involved models, in their analysis, Zhou et al. 
(2018) applied the transient traceability frame- 
work in combination with another technique, 
variance decomposition, to identify the underly- 
ing causes for uncertainty in simulated land carbon 
storage within and across three MIPs. We called 
this kind of analysis post-MIP transient traceabil- 
ity analysis to be distinguished from the authentic 
transient traceability analysis. Figure 18.6 shows 
the schematic diagram of this transient traceability 
analysis within and among the three MIPs. 

In this post-MIP traceability analysis, they 
found that models differ a lot in the global 
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Figure 18.6. Schematic diagram of the transient traceability framework by Zhou et al. (2018) O American Meteorological 


Society. Used with permission. This framework traces the modeled transient carbon storage dynamics to carbon residence time, 


NPP, carbon storage potential, and their source factors. 
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annual carbon residence time, NPP, and car- 
bon storage potential (Figure 18.7a—c). Usually, 
within each model, relative year to year variation 
in carbon residence time is much smaller than 
that of NPP In addition, NPP in those models 
with a coupled nitrogen cycle (e.g., BNU-ESM, 
CESM1(BGC), and NorESM1-Me in CMIP5, and 
CLM4, CLM4VIC, ISAM, and DLEM in MsTMIP) 
is lower than other ESMs without a coupled 
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nitrogen cycle. Carbon residence time and NPP 
show smaller variations across the nine dynamic 
global vegetation models in TRENDY in compari- 
son to those models in CMIP5 and MsTMIP As a 
result, the variations in carbon storage capacity 
in TRENDY are likewise not as large as in CMIP5 
and MsTMIP. 

The global annual carbon storage and carbon 
storage capacity also vary considerably across the 
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Figure 18.7. The 3D model output space (carbon residence time, NPP, and carbon storage potential), and time series of annual 


carbon storage (solid lines) with the shaded outlines indicating the year-to-year fluctuations due to changes in carbon stor- 
age capacity for the models in CMIP5 (a and d), TRENDY (b and e), and MsTMIP (c and f). The points in (a)—(c) represent 
the global annual values for the three variables. The contour lines in (a)—(c) represent the carbon storage capacity. Shading in 


(d)—(f) shows the values of the carbon storage potential for the models (positive above the solid lines, and negative below the 


solid lines). Reproduced from Zhou et al. 2018. © American Meteorological Society. Used with permission. 
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models and the temporal dynamics of carbon stor- 
age and carbon storage capacity of the models are 
highly diverse (Figure 18.7d-f). The large range 
of carbon storage across those models is closely 
related to that of carbon storage capacity. The inter- 
annual trends of carbon storage in the models of 
the three MIPs are mainly affected by the carbon 
storage potential, because the sign and magnitude 
of carbon storage potential determine the direction 


and rate of carbon storage change, respectively. The 
interannual variability of carbon storage in TRENDY 
is much smaller than that in CMIP5 and MsTMIP 
Carbon residence time and NPP are further 
traced to their baseline values and environmental 
scalars. The differences in carbon residence time 
(or NPP) across the models are codetermined by 
the differences in baseline carbon residence time 
(or baseline NPP) and the environmental scalars 
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Figure 18.8. Decomposition of the carbon residence time into the baseline carbon residence time and the environmental scalar 
and decomposition of annual NPP into the baseline NPP and the environmental scalar for CMIPS (a and d), TRENDY (b and e), 
and MsTMIP (c and f). The environmental scalar is a product of the temperature and water scalars, which convert the baseline 
carbon residence time and baseline NPP into their actual values (from Zhou et al. 2018 © American Meteorological Society. 


Used with permission. ). 
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(Figure 18.8). The baseline carbon residence time 
and baseline NPP among the models in the three 
MIPs can differ as much as threefold. In contrast, 
the environmental scalars are more convergent 
among the models for both carbon residence time 
and NPP That indicates that the large differences 
in carbon residence time and NPP across models 
are due mainly to the differences in their baseline 
carbon residence time and baseline NPP, not much 
being caused by environmental scalars. 
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Finally, by adopting a variance decomposition 
method, Zhou et al. (2018) quantified the relative 
contribution of each component to the simulated 
global annual carbon storage for each MIP and for 
all the three MIPs together. The results revealed that 
variations in transient carbon storage are domi- 
nated by carbon residence time and NPP and car- 
bon storage potential only contributes less than 1% 
(Figure 18.9). Moreover, the baseline carbon resi- 
dence time and baseline NPP contribute more than 
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Figure 18.9. Variance decomposition of the carbon storage based on global annual data from models in the three MIPs. First, 


the variation of the carbon storage X is decomposed into that of the carbon residence time Tp, NPP, and the carbon storage poten- 
tial X,. Second, variations of the carbon residence time and NPP are decomposed into their baseline values (T°, and NPP’) and 
the temperature (6, and 6,) and water (éw and ôw) scalars. Positive/negative values mean positive/negative contributions of the 


variables to the variation of carbon storage (from Zhou et al. 2018 © American Meteorological Society. Used with permission.). 
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90% to the variations in carbon residence time and 
NPP, respectively. In contrast, the contributions by 
temperature and water scalars to the variations in 
the carbon residence time and NPP are both less 
than 5%. As a consequence, the variations in simu- 
lated transient carbon storage across the models can 
be primarily attributed to the differences in models” 
baseline carbon residence time and baseline NPP. 
The post-MIP approach adopted by Zhou 
et al. (2018) is a novel approach that provides an 
alternative way to understand the causes of the 
uncertainty of multiple models when authentic 
traceability analysis is not able to be realized due 
to the effort required to accommodate detailed 
information on the original models. Such post-MIP 
traceability analysis can be automatically run with 
a public platform, TraceME (v1.0), which is an 
online traceability analysis system for model evalu- 
ation on land carbon dynamics (Zhou et al. 2021). 
While post-MIP traceability analysis offers useful 
insight, it would be helpful to apply the authen- 
tic traceability analysis, as in the first case of this 
chapter, to MIPs to help identify the specific model 
components and assumptions that dominate model 
uncertainty and focus attention on those issues in 
need of closer scrutiny to improve model behavior. 
After identification of the causes by which the 
models differ in their behavior, modelers can then 
use observational data to determine which mod- 
els are more accurate than others in representing 
the actual processes. This is the realm of bench- 
mark analysis, which will be introduced in detail in 
chapter 19. In this way, model performance can be 
greatly improved towards more realistic projections. 


SUMMARY 


This chapter has demonstrated how the transient 
traceability framework can be applied to address 


different scientific questions with two case studies. 
This recently developed framework has the poten- 
tial to be used more broadly in ways similar to 
the overview of steady-state traceability analysis 
of land carbon storage in chapter 17.The ultimate 
goal of the transient traceability framework is to 
enhance our understanding of how terrestrial eco- 
systems respond to various environmental changes 
and to better incorporate such understanding in 
models to predict the future status of land carbon 
storage. 


SUGGESTED READING 


Jiang LF, Shi Z, Xia JY, Liang JY, Lu XJ, Wang Y, Luo YQ 
(2017) Transient traceability analysis of land carbon 
storage dynamics: procedures and its application 
to two forest ecosystems. ] Adv Model Earth Syst 
9:2822-2835. 

Zhou S, Liang JY, Lu XJ et al. (2018) Sources of uncer- 
tainty in modeled land carbon storage within and 
across three MIPs: Diagnosis with three new tech- 
niques. J Clim 31:2833-2851. 


QUIZZES 


1. Is transient C storage determined by C storage 
capacity and C storage potential? Why/why not? 


2. Is carbon storage potential always positive? 
Why/why not? 
3. Carbon storage potential is co-determined by: 
e Carbon residence time 
e Chasing time 
e Net C pool change 
e NPP 


4. What scientific questions do you think the tran- 
sient traceability framework can address? How 
can it be applied to this purpose? 
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Tremendous progress has been achieved in the 
development of land models and their inclusion 
in Earth system models (ESMs). However, we still 
have very limited knowledge on the performance 
skills of these land models. This chapter intro- 
duces benchmark analysis, which is a procedure 
to measure performance of models against a set 
of defined standards. The benchmark analysis 
includes: (1) defining targeted aspects of model 
performance to be evaluated; (2) testing model 
performance in comparison with a set of bench- 
marks; (3) measuring model performance skill 
through quantitative metrics; and (4) evaluating 
model performance and offering suggestions for 
future model improvement. 


INTRODUCTION 


Over the past decades, tremendous progress has 
been achieved in the development of land models 
and their inclusion in Earth system models (ESMs). 
State-of-the-art land models now account for bio- 
physical processes (exchanges of water and energy) 


DOI: 10.1201/9780429155659-24 


and biogeochemical cycles of carbon, nitrogen, 
and trace gases. They also simulate vegetation 
dynamics and disturbances. When coupled as com- 
ponents in ESMs, land models now allow simula- 
tion of land-atmosphere biophysical interactions 
and climate-carbon feedbacks. These models are 
now widely used for policy-relevant assessment of 
climate change and its impact on ecosystems or ter- 
restrial resources. However, there is still very lim- 
ited knowledge of the performance skills of these 
land models, especially when embedded in ESMs. 
Quantifying the performance skills of land models 
would promote confidence in their predictions of 
future states of ecosystems and climate, and iden- 
tify those models whose predictions are more likely 
to be accurate, where ensemble members diverge. 
Model performance has traditionally been 
evaluated via comparison with observed data 
sets. Validation” by plotting model data side-by- 
side with observed data, or computing mismatch 
metrics such as root-mean-square-error, is tra- 
ditionally the most common approach to model 
evaluation (Oreskes, 2003; Rykiel, 1996; see also 
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Figure 19.1. Schematic diagram of the benchmarking framework for evaluating land models. The framework includes four 


major components: (1) defining model aspects to be evaluated, (2) selecting benchmarks as standardized references to test 
models, (3) developing a scoring system to measure model performance skills, and (4) stimulating model improvement. 


(Adopted from Luo et al., 2012). 


Chapter 2). However, a land model typically simu- 
lates hundreds of biophysical, biogeochemical, 
and ecological processes at regional and global 
scales over hundreds of years. It would be unrealis- 
tic to undertake validation of so many processes at 
all spatial and temporal scales, even if observations 
were available. The complex behavior of these 
interacting processes can be realistically under- 
stood only if we holistically assess land models 
and their major components. Benchmark analysis 
is an approach that has been recently developed to 
evaluate the performance of land models. 
Benchmark analysis is a standardized evalua- 
tion of one system's performance against defined 
reference data (i.e., benchmarks) that can be used 
to diagnose the system's strengths and deficien- 
cies for future improvement (Luo et al., 2012). 
Benchmark analysis has been recently applied to 
evaluate land models against observations (Collier 
et al., 2018). A benchmark analysis has four ele- 
ments: (1) targeted aspects of model performance 
to be evaluated; (2) benchmarks as defined ref- 
erence data to evaluate model performance; (3) 
a scoring system of metrics to measure relative 
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performance among models; and (4) evaluated 
performance of models and future improvement 
(Figure 19.1). 


ASPECTS OF LAND MODELSTO BE EVALUATED 


Land models typically simulate many processes. 
Although individual studies may assess only a 
few aspects of model performance, a comprehen- 
sive benchmark analysis is required to evaluate all 
these major components when land models are 
integrated with ESMs. The performance of a model 
should be evaluated for its baseline simulations 
over broad spatial and temporal scales, and mod- 
eled responses of land processes to global change. 

The baseline state for biogeochemical cycles 
includes simulated global totals, spatial distribu- 
tions, and temporal dynamics of gross primary 
production, net primary production, vegetation 
and soil carbon stocks, ecosystem respiration, litter 
production, litter mass, and net ecosystem produc- 
tion. For example, the International Land Model 
Benchmarking (ILAMB) project evaluated bio- 
mass, burned area, gross primary productivity, leaf 
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area index, global net ecosystem carbon balance, 
net ecosystem exchange, ecosystem respiration, 
and soil carbon (Collier et al., 2018). 

To reliably predict future states of ecosystems 
under a changing environment, land models have 
to realistically simulate responses of land processes 
to disturbances and global change. Major global 
change factors include rising atmospheric CO, 
concentration, increasing land use and surface air 
temperature, altered precipitation amounts and 
patterns, and changing nitrogen (N) deposition. 
The direct effects of these global change factors 
are relatively easily benchmarked since we have 
direct knowledge of how ecosystems respond to 
rising atmospheric CO, concentration, increasing 
temperature, altered precipitation, and changing 
nitrogen deposition. However, indirect effects of 
these factors on ecosystem carbon processes are 
not well understood, although many field experi- 
ments have been conducted. Thus, it is more 
difficult to benchmark model performance in pre- 
dicting future states of ecosystems. 


REFERENCE DATA SETS AS BENCHMARKS 


A comprehensive benchmarking analysis usu- 
ally uses a set of benchmarks, against which land 
model performance can be evaluated (Table 19.1). 
Benchmarks could consist of direct observations, 
results from manipulative experiments, data-model 
products, or data-derived functional relationships. 
Direct observations and experimental results are 
generally accepted to be the most reliable bench- 
marks for model performance and are typically 
referred to as reference data. Reference data that 
are often used for benchmarking biogeochemical 
cycle models include global data products of gross 
primary production (GPP), net primary produc- 
tion (NPP), soil respiration, ecosystem respiration, 
plant biomass, and soil carbon. When they are used 
in a benchmarking analysis, reference data sets are 
usually assessed and weighted for their degree of 
certainty, scale appropriateness, and overall impor- 
tance of the constraint or process to model pre- 
dictions (Collier et al., 2018). The ILAMB project 
evaluates eight variables using a variety of refer- 
ence data as listed in Table 19.1. 

Land models can also be evaluated on their 
simulated variable-to-variable relationships in 
comparison with relationships in observations. 
For example, model representations of the rela- 
tionships that GPP exhibits with precipitation, 


TABLE 19.1 


Reference data sets used to measure ecosystem and 
carbon cycle performance 


Variables 


Reference data 
sets 


Description 


Biomass 


Burned area 


GPP 


Leaf area 
index 


Global NECB 


Net 
ecosystem 
exchange 


Ecosystem 
respiration 


Soil carbon 


Tropical (Saatchi 
et al., 2011) 


NBCD2000 
(Kellndorfer 
et al., 2013) 


USForest 
(Blackard 
et al., 2008) 


GFED4S (Giglio 
et al., 2010) 


Fluxnet (Lasslop 
et al., 2010) 


AVHRR (Myneni 
et al., 1997) 


MODIS (De 


Kauwe et al., 
2011) 


GCP (Le Quéré 
et al., 2016) 


Fluxnet (Lasslop 
et al., 2010) 


Fluxnet (Lasslop 
et al., 2010) 


HWSD (Todd- 
Brown et al., 
2013) 


NCSCDV22 
(Hugelius 
et al., 2013) 


forest carbon stocks 
in tropical 
regions across 
three continents 


aboveground 
biomass and 
carbon baseline 
data in north 
America 


US. forest biomass 


variability and long- 
term trends in 
burned area 


net ecosystem 
exchange, 
photosynthesis, 
and respiration 


global land cover, 
LAI and FPAR 


leaf area index 
product for a 
region of mixed 
coniferous forest 


global carbon budget 
2016 


net ecosystem 
exchange, 
photosynthesis, 
and respiration 


net ecosystem 
exchange, 
photosynthesis, 
and respiration 


Harmonized World 
Soil Data 


organic carbon 
storage to 3m 
depth in soils 
of the northern 
circumpolar 
permafrost 
region. 


NECB = net ecosystem carbon balance 
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evapotranspiration, and temperature are often 
assessed. Such variable-to-variable relationships 
are quantified over a time period from reference 
data sets and used as benchmarks for the relation- 
ships diagnosed in models. This approach is par- 
ticularly effective to understand the consistency 
between the observed and simulated sensitivity of 
ecosystem responses to climate change. 


BENCHMARKING METRICS 


A comprehensive benchmarking study usually 
uses a suite of metrics across several variables to 
holistically assess model performance at the rel- 
evant spatial and temporal scales. Many statistical 
measures are available to quantify mismatches 
between multiple modeled and observed variables. 
Five metrics were developed for ILAMB to evaluate 
model performance. The five metrics are to mea- 
sure bias, root-mean-square-error (RMSE), phase 
shift, interannual variability, and spatial distribu- 
tions (Collier et al., 2018). 

The bias measures differences between the 
mean value of the reference data and that of the 
model over the same time period and the same 
spatial area. For example, the bias of gross primary 
productivity between the reference data and the 
model (e.g., Community Land Model version 4.5, 
CLM4.5) is calculated between their respective 
means in each grid cell where both reference data 
and modeled values are available. To account for 
the bias due to the variability at any given spatial 
location, the bias is nondimensionalized as a rela- 
tive error to measure the bias score. 

RMSE is computed as the square root of the 
mean square error between modeled values and 
the reference data over a time period. The RMSE 
is normalized by the centralized RMSE of the ref- 
erence data set to get a relative error as a score. 
By scoring the centralized RMSE, the bias is 
removed from the RMSE, allowing the RMSE score 
to be focused on an orthogonal aspect of model 
performance. 

The phase shift is evaluated for the annual 
cycle of many data sets that have monthly variabil- 
ity by comparing the timing of the maximum of 
the annual cycle of the variable at each spatial cell 
across the time period of the reference data set. The 
phase shift is calculated as the difference between 
the reference and model data sets by subtracting 
their respective maximum values in days. 
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The interannual variability in model simula- 
tions is evaluated by removing the annual cycle 
from both the reference data and the model. A 
score is then computed as a function of their dif- 
ferences over space. 

The spatial distribution of any time-averaged 
variable is evaluated by computing the standard 
deviation of modeled values over space normalized 
by the standard deviation of the reference data. The 
spatial correlation is also calculated for the period 
mean values of reference data and modeled values. 
A score is assigned by the penalty for large devia- 
tion of the normalized standard deviations and the 
spatial correlation from a value of 1. 

The overall score for a given variable and data 
product is a weighted sum of the five metrics, pro- 
ducing a single scalar score for each variable for 
every model or model version. Readers who are 
interested in details of these metrics may study the 
paper by Collier et al. (2018). 


PERFORMANCE OF THREE CLM VERSIONS AND 
FUTURE IMPROVEMENTS 


The metrics for bias, RMSE, seasonal cycle phase, 
spatial distribution, interannual variability, and 
variable-to-variable assessments were applied to 
evaluate three CLM versions (CLM4 vs. CLM4.5 vs. 
CLM5) under two forcing data sets (GSWP3v1 vs. 
CRUNCEPv7) (Lawrence et al., 2019). The quality 
of the simulations across model generations was 
found to be generally improving. CLM5 outper- 
forms CLM4 for the majority of assessed variables 
(Figure 19.2).The improvements from CLM4.5 to 
CLM5 were relatively subtle in that several vari- 
ables show improvement (e.g., biomass, burned 
area, LAI, net ecosystem carbon balance, net eco- 
system exchange, and ecosystem respiration) but 
others show degradation (e.g., soil carbon). 

The functional relationships were also assessed 
between two variables (e.g., precipitation vs. GPP 
or LAI) (Figure 19.3). CLM5 performed bet- 
ter than CLM4 or CLM4.5 for the relationships 
between GPP and climate variables. However, the 
relationship between GPP and surface air tempera- 
ture slightly degraded from CLM4.5 to CLM5. 

The ILAMB benchmark analysis provides some 
insights into model development. An improve- 
ment or degradation trend between two CLM ver- 
sions can result from a mix of scores for individual 
metrics. The degradation in the simulations of 
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Figure 19.2. Evaluation of performance of CLM4, CLM4.5, and CLM5 under two sets of forcing, GSWP3v1 and CRUNCEPv7. 


A stoplight color scheme is used to indicate aggregate performance for each model by variable. (Adopted from Lawrence et al., 
2019). 
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Figure 19.3. Variable-to-variable comparison between annual precipitation and LAI for CLM4, CLM4.5, and CLMS under the 
GSWP3v1 forcing The black line is the observationally derived relationship. Error bars indicate the +1 standard deviation of LAI 
for all grid cells that lie within that precipitation bin. Values in parentheses are the scores for that comparison. (Adopted from 
Lawrence et al., 2019). 
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soil carbon stocks from CLM4.5 to CLM5 is par- 
tially due to high uncertainty in the observational 
estimates. Another metric evaluates the models 
against apparent soil carbon turnover time, show- 
ing an improvement from CLM4.5 to CLM5. The 
disagreement between two metrics of soil carbon 
may suggest the need for future improvement of 
observationally constrained estimates. 

Model performance depends on three elements: 
model structure, parameterization, and forcing 
(see Chapters 21 and 33). The model structure 
that simulates soil carbon dynamics in CLM is pri- 
marily based on first-order kinetics. Although this 
model structure has been questioned, almost all 
data sets from studies of litter decomposition and 
soil incubation suggest the structure may be highly 
reliable (Chapter 1). Model parameterization is 
likely the main cause of the model-data mismatch. 
Chapter 37 discusses methods to improve model 
parameterizations of CLM5 to improve model 
performance. 


CONCLUSIONS 


A four-component benchmark analysis was out- 
lined: (1) identification of aspects of models to be 
evaluated; (2) selection of benchmarks as standard- 
ized references to evaluate models; (3) a scoring 
system to measure model performance skills; and 
(4) evaluation of model performance to inform 
model improvement. The International Land Model 
Benchmarking (ILAMB) project has developed an 
open-source model benchmarking software pack- 
age to score model performance. ILAMB has devel- 
oped a suite of reference data sets as benchmarks, 
five metrics plus variable-to-variable relationships 
as the scoring system to evaluate models or model 
versions. The ILAMB package has been applied to 


perform comprehensive model assessment across 
a wide range of land variables. Such benchmark 
analysis offers insights into strengths and weak- 
nesses of different models or model versions for 
identifying future improvements. 
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QUIZZES 


1. What are the similarities and differences between 
model validation and benchmark analysis? 


2. How does benchmark analysis evaluate model 
performance? 


3. What variables in carbon cycle models would 
you choose to be evaluated by a benchmark 
analysis? 

4. What data sets do you think would be important 
to be used as benchmarks to evaluate models? 


5. What five metrics does the ILAMB package use 
to score model performance? 
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TERRESTRIAL CARBON CYCLE MODELS 
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CONTENT 


Introduction / 163 


The practice is designed to help you learn trace- 
ability analysis to identify sources of model uncer- 
tainty in predicting terrestrial carbon (C) storage. 
All practices are performed in the training software 
CarboTrain. With this tool, you will apply trace- 
ability analysis to simulation results from a matrix 
form model (called authentic traceability analy- 
sis) and to model intercomparison projects (MIP) 
without matrix models (i.e., post-MIP traceability 
analysis). The authentic traceability analysis will 
show you how simulation results from a matrix 
model are explained by traceable components over 
space and among biomes. The post-MIP traceabil- 
ity analysis can help you understand the sources of 
uncertainty among different models. 


INTRODUCTION 


Traceability analysis provides an approach to divide 
the simulated land carbon dynamic into several 
traceable components, such as carbon (C) storage 
capacity, gross primary productivity (GPP), C resi- 
dence time, and environmental scalars. This prac- 
tice offers two exercises. The first uses authentic 
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traceability analysis in which simulation results by 
the CABLE matrix model are explained by traceable 
components hierarchically over space and among 
biomes. The authentic traceability analysis for the 
first exercise is described in detail in Chapter 17, 
based on the study by Xia et al. (2013).The second 
exercise uses post-MIP traceability analysis to attri- 
bute variations in modeled land C storage among 
three CMIP6 models (i.e., CESM2, CNRM-ESM2-1, 
and IPSL-CM6A-LR) over 1980-2000 to differ- 
ent sources. The post-MIP traceability analysis is 
described in detail in Chapter 18. More informa- 
tion is available in papers by Zhou et al. (2018, 
2021). The authentic traceability analysis can pin- 
point model uncertainty to individual processes 
and/or specific parameter values but requires 
matrix models. In comparison, the post-MIP trace- 
ability analysis can be applied to any modeling 
results to understand sources of uncertainty but 
may not be able to trace uncertainty to specific 
processes and/or parameters. 

The exercises are performed in the training soft- 
ware CarboTrain by selecting Unit 5 and Exercises 
1—2. Click Run Exercise to generate the figures. 
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EXERCISE 1: Authentic traceability analysis 


This exercise helps you learn to do traceability 
analysis with a matrix model, which derives 
from the Community Atmosphere-Biosphere- 
Land Exchange (CABLE) model. CABLE is one 
of the most widely used global land models for 
simulating terrestrial biogeochemical and bio- 
physical processes. The C cycle diagram of the 
CABLE model is shown in Chapter 14, Figure 
14.1. CABLE has nine carbon pools, includ- 
ing the plant pools (leaf, root and wood), the 
litter pools (metabolic and structural litter as 
well as coarse woody debris) and three soil 
pools (microbial biomass, slow and passive soil 
organic matter). The matrix form of the CABLE 
model we will use here was derived by Xia et al. 
(2013). In this exercise, the spatial resolution is 
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Config 2 Config 3 and 4 Config 6 


Config 7 


Here you can use the CarboTrain software 
to do this exercise. In the main window of 
CarboTrain (Figure 20.1), choose unit 5 and 
exercise 1, and then set the output path of the 
running results by clicking the “Set Output 
Folder” button. After you have done all the above 
steps, click “Run Exercise” to start the exercise. 

A pop-up window will show up with the 
message “Task submitted!” after you click the 
“Run Exercise” button as shown in Figure 20.2. 
After clicking “OK”, the task will run automati- 
cally in your computer. 

The command window shows the running 
processes of the task and you can see which 
step of the program is complete as shown in 
Figure 20.3. When the task is finished, another 
pop-up window will show up with the message 


Config 8 and 9.1 | Config 9.2 and 9.3 | Config 10.1 *|»| 


Out Dir 


course/CarboTrain-20210426T024346Z-001 /uni t5-output/output-exl 


| 
Run Exercise | _ 


Figure 20.1. Steps to run Exercise 1. 


1 x 1°. The land grid cells are categorized into 
nine biomes including Evergreen Needleleaf 
Forest (ENF), Evergreen Broadleaf Forest (EBF), 
Deciduous Needleleaf Forest (DNF), Deciduous 
Broadleaf Forest (DBF), C, Grassland (C3G), C, 
Grassland (C4G), Tundra, and Barren/sparse 
vegetation (Barren). 
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E i Info 
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@ Task submitted! 


Figure 20.2. Tips at the beginning and end of the task. 
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Figure 20.3. Prompt in CMD interface when the program is running. 
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Figure 20.4. Output results after the task is complete. 


“Finished!” as shown in Figure 20.2. The soft- 
ware will generate the running results to the 
output path you set before. 

Figure 20.4 shows an example of the output 
results after the task is completed. Enter the out- 
put path you set before, you will see a directory 
named unit5 exercisel. Enter this direc- 
tory and find the files as shown in Figure 20.4. 

There are four data files in the folder 
unit5 exercisel/dataSource. The 
simulated global distribution of total ecosys- 
tem C storage capacity by the CABLE model is 
recorded in data_global_ctot . csv.The ‘nan’ 
values represent ocean grids or non-vegetated 
lands. The global map of the total ecosystem C 
storage capacity is shown in Figure 20.5a. 

The file data npp resTime.xlsx 
provides the simulated net primary productiv- 
ity (NPP) and ecosystem C residence time in 
each land grid cell. Figure 20.5b shows how 
NPP and ecosystem C residence time together 
determine the spatial difference in ecosystem 
C storage capacity in the CABLE model. For 
example, ENF has an intermediate NPP (0.39 
kg Cm? yr!) and a relatively long C residence 
time (86.4 years), leading to the largest total 
ecosystem carbon storage capacity (34.1 kg 
C m?) among the nine biomes. By contrast, 
Tundra has a small ecosystem C storage capacity 


Date modified Type 
4/26/2021 10:50 AM File folder 
5/4/2021 4:49 PM File folder 
4/26/2021 10:50 AM File folder 
12/25/2020 9:45 AM Python File 


(8.7 kg C m~?) due to the low NPP (0.1 kg C 
m”? yr’), though its C residence time is long 
(141.2 years). 

The file residence components. 
x1sx contains simulated results of total ecosys- 
tem C residence time, baseline residence time, 
and environmental scalar in each land grid cell. 
Figure 20.5c shows how baseline C residence 
time and environmental scalar jointly determine 
the global distribution of ecosystem C residence 
time in the CABLE model. It is clear that the order 
of ecosystem C residence time among biomes is 
different from that of baseline C residence time. 
In CABLE, ecosystem C residence time changes 
with the biome types as DNF (163.3 years) > 
Tundra (141.2 years) > ENF (86.4 years) > 
Shrubland (52.6 years) > DBF (33.3 years) > 
C3G (26.6 years) > EBF (26.3 years) > Barren 
(20.4 years) > C4G (17.5 years). 

The environmentalScalars.xlsx 
file provides the mean annual precipitation, 
mean annual temperature, water scalar, and 
temperature scalar in each land grid cell. As 
shown in Figure 20.5d-e, the environmental 
scalars link the climate forcings directly to ter- 
restrial C cycle processes in the CABLE model. 
Figure 20.5e shows that the temperature sca- 
lar varies systematically among biomes in the 
CABLE model, whereas the mean water scalar is 
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Figure 20.5. Simulated spatial distribution of terrestrial carbon storage capacity and its traceable components by the 
CABLE model. (a) Spatial distribution of total ecosystem C storage capacity; (b) determination of the ecosystem carbon 
storage capacity by NPP and ecosystem residence time in various biomes; (c) dependence of ecosystem C residence time 
on baseline C residence time and the environmental scalar in various biomes; (d) global distribution of major biomes 


in relation to annual temperature and precipitation; (e) global distribution of major biomes in relation to water and 
temperature scalars. 


distributed in a narrow range from 0.65 in EBF in ecosystem C residence time among biomes 
to 0.87 in DNF. in the CABLE model? Why? 
The figures of this exercise can be found in 


2 Which biome has the longest baseline C resi- 
unit5 exercisel/output_figs.More dence time? Why? 


details about the CABLE model and simulations 


3 Would you expect the incorporation of 
in this exercise are provided in Chapter 17. 


nitrogen cycling to influence the simulated 
ecosystem C storage compared to the C-only 
QUESTIONS: model? How can this question be explored 
by applying the authentic traceability analysis 


1 Which environmental factor, water or tem- with the CABLE model? 


perature, contributes more to the difference 
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EXERCISE 2: Post-MIP traceability analysis 


In this exercise we will practice performing 
uncertainty analysis of carbon cycle modeling 
after simulations were done without models 
being converted to matrix equations (Chapter 
18). Because the outputs of transient land C 
storage rather than long-term steady state 
carbon storage (i.e., carbon storage capacity) 
are available from CMIP6 models, we com- 
pare the simulation results of the three CMIP6 
models over 1980-2000. Here you also use 
the CarboTrain software to do this exercise. 
Launch CarboTrain and select unit 5 and exer- 
cise 2. Open the Config 5 tab, where you can 
customize the spatial and temporal ranges of 
the running results by entering latitude and 
longitude and temporal ranges. Please note 
that the data entered here should be within the 
allowable range (-90 to 90 for latitude, 0 to 
360 for longitude, 1980 to 2000 for the tem- 
poral range). Set the output path by clicking Set 


Start to read data 
art to calcualte | 


tart to 
Save nc 
=Temporal T 


Output Folder. Finally, click Run Exercise to start the 
traceability analysis of the three CMIP6 mod- 
els. (Note that the task running time is related 
to the selected ranges and the configuration of 
the computer). 

The command window shows the running 
processes of the task as shown in Figure 20.6. 
When the task is finished, results will appear in 
the output path you set before (Figure 20.7). 

Here are descriptions of the files and the 
workflow of the model evaluation system in 
CarboTrain. All files can be found in or under 
the unit5 exercise2 folder in the out- 
put directory you specified earlier. The pack- 
age consists of a file named Preset. txt, 
three Python scripts (Main_traceme.py, 
AnnualTAT.py, RegionTAT.py), 
three subfolders (data, results, R docs). 
Preset .txt contains a specification of the 
CMIP6 model outputs stored in the data folder. 
Traceability analysis requires the outputs of each 
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Figure 20.6. Prompt in CMD interface when the program is running. 
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Figure 20.7. Output results after the task is done. 
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model to include total C storage (vegetation C, 
soil C and/or litter C), GPP, NPP, temperature, 
and precipitation. After the task is submitted, 
the Main traceme.py script reads the 
information in Preset.txt to preprocess 
the data. The preprocessed data is transferred to 
the AnnualTAT.py and RegionTAT.py 
scripts to perform temporal and spatial trace- 
ability analysis, respectively. The R_ doc folder 
contains the R language script used to calculate 
the variance contribution of different compo- 
nents to C storage in the temporal traceability 
analysis. The results of the traceability analysis 
are output in the results folder. 

We will now step through using the trace- 
ability analysis to analyze the differences in land 
C storage among three CMIP6 models (i.e., 
CESM2, CNRM-ESM2-1, and IPSL-CM6A-LR). 
In the results folder, you can find the results 
of the temporal traceability analysis on the three 
CMIP6 models over 1980—2000. First, you can 
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find differences in the time series of simulated 
C storage among the models (temporal - 
1-Carbon-Dynamic.png; Figure 20.84). 
It is clear that the IPSL-CM6A-LR model has the 
lowest land C storage among the three mod- 
els for the simulation period. The traceability 
analysis decomposes the C storage into C stor- 
age capacity and potential (Figure 20.8a). The 
result showed that the lowest land C storage in 
IPSL-CM6A-LR was due to the lowest C stor- 
age capacity rather than the C storage potential. 
Then, ecosystem C storage capacity is further 
decomposed into NPP and C residence time 
(temporal-2-NPP-ResidenceTime. 
png; Figure 20.8b). We can see that the lowest 
ecosystem C storage capacity in IPSL-CM6A-LR 
was driven by the shortest ecosystem C residence 
time among the three models. Furthermore, 
NPP is decomposed into GPP and C use effi- 
ciency (CUE) (temporal-3-GPP-CUE. 
png; Figure 20.8c). Ecosystem C residence time 
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Figure 20.8. Results of the temporal traceability analysis on three CMIP6 models from 1980 to 2000. (a) Land C stor- 
age decomposed into C storage capacity and potential. (solid lines: C storage; shaded outlines: C storage capacity; shades: 
C storage potential (positive above and negative below the solid lines)); (b) C storage capacity decomposed into NPP 
and residence time; (c) NPP determined by GPP and CUE; (d) residence time decomposed into baseline residence time 


and environmental scalar. 


PRACTICES 


can be decomposed into baseline C residence 
time and environmental scalars (temporal - 
4-Envs-baselineResidenceTime. 
png; Figure 20.8d). The simulated environ- 
mental scalars have not been provided for each 
model, so we cannot here further evaluate how 
climate forcings influence the baseline C resi- 
dence time. However, among the three earth 
system models in this exercise, it is clear that 
the model parameterization of C allocation, C 
age, and C transfers among pools are the most 
important contributors to the large difference 
in global land C dynamic in Figure 20.8a. 

In the results/nc-files folder, you 
can access the NetCDF-format data of the 


(a) CESM2 


temporal and spatial traceability analysis for 
each model. In the results/figures 
folder, you can find the outputs that show the 
results of the spatial traceability analysis on 
the three models over 1980-2000. Each fig- 
ure includes the global distribution of a vari- 
able simulated by each model and the standard 
deviation of that variable. First, you can get the 
global distribution of the simulated C storage 
by each model and the global distribution of 
standard deviation of simulated ecosystem C 
storage among the models. The global distribu- 
tions of all traceable components in Figure 20.8 
are mapped. Figure 20.9 shows that the lowest 
baseline C residence time in IPSL-CM6A-LR is 
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Figure 20.9. Global distribution of baseline carbon residence time in three CMIP6 earth system models over 1980-2000. 
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widely distributed in the Northern Hemisphere, 2. Based on the Figure 20.9, can you figure 


except for the Tibetan Plateau. This analysis is 
helpful for informing developers of the IPSL- 
CM6A-LR model on how they may further 
improve their parameterization of ecosystem 
baseline C residence time. 


QUESTIONS: 


1. Among the three earth system models exam- 
ined in this exercise, why does the IPSL- 
CM6A-LR model simulate the lowest land 
C storage? Which traceable component con- 
tributes most to the low values for C storage? 


PRACTICE 5 


out which region has the largest variation in 
baseline C residence time among the three 
models? Based on the output of the post- 
MIP traceability analysis, can you generate 
a global map of the standard deviation of 
baseline C residence time among the three 
models? 


. If we want to apply the authentic traceability 


analysis to a model-intercomparison project 
(e.g., CMIP6), what additional information 
or modeling outputs are needed from each 
model? 
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Realistic prediction of ecosystem responses to cli- 
mate change requires not only a perfect model 
structure to represent the real-world processes but 
also parameterization to constrain model speci- 
fications and external forcing variables to reflect 
the environment that an ecosystem experiences 
to perform its functioning. Data assimilation is a 
statistical approach to model parameterization. 
This chapter introduces data assimilation, mainly 
focusing on concepts, procedure, and applications. 
We relate data assimilation to regression analysis to 
show a seven-step procedure: defining a research 
objective, having data, using one model, mea- 
suring data-model mismatches, minimizing the 
mismatches via global optimization, estimating 
parameters, and predicting ecosystem changes. 


INTRODUCTION OF DATA ASSIMILATION 


Data assimilation is a statistically rigorous approach 
to model parameterization. The latter is one of the 
three essential elements of realistic model predic- 
tion. To realistically predict ecosystem responses 
to climate change, we need the model structure 
to represent the real-world processes that control 
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system functions. We also need model specification 
through parameterization to constrain model pre- 
dictions (Figure 21.1). Moreover, the external forc- 
ing variables have to reflect the physical, chemical, 
and biological environment that an ecosystem expe- 
riences to perform its functioning. Ideally, all the 
three elements have to be perfectly aligned before a 
model can well predict ecosystem responses. 

In reality, we spend much more time develop- 
ing process-based models than thinking about 
parameterization or forcing. When prediction of a 
model does not match well with observations, we 
usually look at model structure and often ignore 
parameterization and forcing. In fact, param- 
eterization cannot be totally ignored. To make a 
model work, we have to tune parameters. But we 
have not learned much from parameter tuning 
in the past several decades. Data assimilation is a 
relatively new approach that can be used to rigor- 
ously estimate parameter values and offers a new 
way to learn about model parameterization. We 
will learn about data assimilation in the training 
units 6-10. 

External forcing will be explored in unit 9. In 
short, we need to have a data-model consistent 
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Figure 21.1. Three elements required for realistic model prediction of ecosystem responses to global change. Parameterization 


is equally important as model structure and forcing in predicting states of ecosystems under global change. Data assimilation 


offers a statistically rigorous approach not only to fit a model better with data, but also enable it to evaluate which, how, how 


much, and why parameters change. The workflow system Ecological Platform for Assimilating Data (EcoPAD) to link real-time 


forcing variables for automating forecasting and predictability of terrestrial carbon cycle is discussed in Chapter 33. 


system, which can be realized by an interactive 
Ecological Platform for Assimilating Data (EcoPAD) 
or similar workflow systems, to have realistic 
external forcing variables to drive models for eco- 
logical forecasting. 


THE NEED FOR DATA ASSIMILATION 


To fully understand the need of data assimilation, 
we should first understand process-based model- 
ing. Traditional modeling is often called simula- 
tion modeling or forward modeling. Simulation 
modeling usually needs to develop a process-based 
model first. The model usually means a model 
structure with a series of equations to represent 
processes in a system and assigned parameter val- 
ues. Then, we use forcing variables to drive the 
model. The forcing variables for an ecological 
model usually include radiation, temperature, and 
precipitation. When the forcing variables are used 
to drive the model, the model generates outputs, 
such as carbon storage or net ecosystem exchange. 
The model outputs are sometimes compared with 
data for validation. This simulation approach has 
been extensively used by ecologists for about 60 
years. It is very powerful for exploring ideas and 
hypotheses in ecology. 


Here is an example of a simulation modeling 
study done at the Free Air CO, Enrichment (FACE) 
Duke Forest project. The project had three FACE 
rings and three control rings plus one prototype 
ring. In the FACE rings, the CO, concentration was 
200 parts per million higher than that in the three 
ambient rings. One of the ideas for the FACE study 
was to show how a forest would be when CO, 
concentration increases to the elevated level in the 
middle of this century. The hypothesis was that for- 
est carbon sequestration in the elevated CO, rings 
would be an estimate when the atmospheric CO, 
concentration was gradually increasing to that level. 

Dr. James Reynolds and I worked together to 
test this hypothesis using a simulation model- 
ing approach (Figure 21.2). We used an ecosys- 
tem model to test that hypothesis. We simulated 
two scenarios. One was to have CO, concentra- 
tion gradually increasing as shown in panel A or 
abruptly increased as in panel D of Figure 21.2. 
When the atmospheric CO, concentration is grad- 
ually increasing, both photosynthesis and ecosys- 
tem respiration are gradually increasing as shown 
in panel B. But respiration lags behind due to the 
carbon that enters the ecosystem through photo- 
synthesis and has to stay in the ecosystem for a 
while before it is released via respiration. The dif- 
ference between photosynthesis and respiration is 
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Figure 21.2. Simulation model to evaluate responses of ecosystem carbon (C) processes to a gradual vs. step increase in atmo- 
spheric [CO,]. In response to the gradual increase in [CO,] (A), both photosynthesis and respiration increased gradually (B), 
leading to a gradual increase in carbon sequestration (C). In response to the step CO, increase (D), photosynthesis immediately 
increases, but respiration slowly increases (E), leading to a high rate of carbon sequestration right after the step increase in CO,, 


followed by decline. 


the ecosystem carbon sequestration as shown in 
panel C. The modeled carbon sequestration gradu- 
ally increases under the gradual CO, increase sce- 
nario in the real world. In contrast, photosynthesis 
abruptly increases in response to a step increase 
in CO, concentration as in the FACE experiment, 
but ecosystem respiration gradually increases, 
leading to a pulse increase in carbon sequestration 
followed by a gradual decline as shown in panel 
F. This is a simulation modeling study. It is quite 
powerful to contradict the original hypothesis that 
carbon sequestration observed in FACE plots can 
be extrapolated to the real world when CO, con- 
centration is gradually increasing. 

However, when simulation models are used 
for prediction or forecasting, the model and data 
usually do not match well. Predictions of soil car- 
bon density by eleven models used in the Coupled 
Model Intercomparison Project Phase 5, CMIP5, 
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do not match with the observation-based carbon 
density from the homogenized world soil carbon 
database (HWSD) (Luo et al. 2015). None of the 
model predictions match with observations well. 
The mismatches between the models and observa- 
tions are partly due to the lack of data constraints 
of either model structures or parameterization. 
Data assimilation is also needed to integrate 
information contained in both model and data. 
Modeling is one approach to scientific inquiry 
mainly through process thinking. Data are obtained 
from field or laboratory research, which is also an 
approach to scientific inquiry through snapshot 
records of ecosystem states at the time when the 
observation was made. The two approaches acquire 
information on different aspects of an ecosystem. 
The information acquired from recording the state 
of the ecosystem is highly complementary to that 
of process understanding. Integration of model and 
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data will help gain the best knowledge from imper- 
fect data and imperfect models (Luo et al. 2011). 
When data assimilation is used for integration 
of data and models, it starts with data and associ- 
ated uncertainty (or noise). Data is combined with 
modeled values to inversely infer model struc- 
tures and/or parameter values. Therefore, data 
assimilation is sometimes called inverse analysis, 
data-model fusion, inverse modeling, multiple 
constraints, or inference analysis, depending on 
situations when this technique is applied. 


SEVEN-STEP PROCEDURE OF DATA 
ASSIMILATION 


A data assimilation study is usually conducted by 
following a seven-step procedure. I will explain the 
seven steps mainly using the example in the study 
by Xu et al. (2006). You may go back and forth 
between this chapter and the paper by Xu et al. 
(2006). This seven-step procedure is explained in 
the practice of unit 6 and other chapters. 

Data assimilation is fundamentally a statistical 
analysis. It is very similar to regression analysis in 
terms of the procedure. 


Regression analysis 


Y = 25.8 + 3.69X 
r=0.94 
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Arm length (cm) 


1. Using arm length (X) to predict height (Y) 


(objective) 


Select an equation Y = a + bX 
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6. Obtain the optimized parameters a and b values 
7. Use the optimized equation to do prediction 7. 


Collect data of X and Y of all people in a group 2. 


Measure mismatch (y — 7), fis an estimate of y 4. 
Minimize Y» (y — 9)? to optimize estimation of a 


When we conduct a regression analysis to 
understand a relationship, for example, between 
arm length vs. height, we usually have to go 
through seven steps. First, we need to decide what 
the objective is. In this case, we try to predict height 
from arm length. Second, we need to collect data 
of the arm length and height of all individuals in a 
sample of a population we are going to investigate. 
Third, we need to select a regression equation. In 
this case, a linear model is adequate. Fourth, we 
need to measure mismatches between observed 
heights, y, and estimated heights y from the 
regression line. Fifth, we will minimize the mis- 
match using the method of least squares. Sixth, we 
will obtain the optimized estimates of regression 
coefficients a and b of the linear model. The sev- 
enth step is to predict a height from the arm length 
of one individual. For example, the regression line 
between heights and arm lengths from a sample in 
Nigeria reported from a paper by Vanderjagt et al. 
(2001) shows y = 25.8 + 3.69x, where x is the 
length of arm and y is the height (Figure 21.3). 
In short, the seven steps of conducting regression 
analysis are: (1) setting up an objective; (2) collect- 
ing data; (3) selecting an equation; (4) measuring 
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Figure 21.3. A seven-step procedure for both regression analysis and data assimilation. Data assimilation is a statistical approach 


and has a similar procedure with regression analysis. A key measure is fitness between data and model values in regression and 
posterior probability distributions of estimated parameters in data assimilation, although the data-model fitness is also impor- 


tant for data assimilation. 
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mismatches; (5) optimization by minimizing the 
sum of squares of the residuals; (6) estimation of 
parameters; and (7) prediction (Figure 21.3). 

The procedure to conduct a data assimilation 
study is similar to a regression analysis. To conduct 
a data assimilation study, we also need to: first, 
set up an objective, such as predicting ecosystem 
response to global change; second, have data sets; 
third, have a model; fourth, develop a cost func- 
tion to measure mismatches between observations 
and model values; fifth, have global optimization 
to minimize the cost function; sixth, estimate 
parameter values; and seventh, do prediction. 

We use a study done by Xu et al. (2006) to 
show the seven steps. First, the objective of the 
study was to predict terrestrial carbon sequestra- 
tion in response to elevated CO, concentration. 
Second, the study used six data sets at the ambi- 
ent CO, treatment and six data sets at the elevated 
CO, treatment. Third, the study used the Terrestrial 
ECOsystem (TECO) model. Fourth, mismatches 
between observed and modeled values were mea- 
sured by a cost function. Fifth, the optimization 
method used in the study was the Markov Chain 
Monte Carlo with Metropolis-Hasting algorithm. 
Sixth, the estimated parameters were decomposi- 
tion coefficients of litter and soil organic carbon. 
Seventh, prediction was on carbon sequestration in 
nine pools of the TECO model at both the ambient 
and elevated CO, treatments. 

Although the procedure is very similar to the 
regression analysis, there are some new concepts, 
such as mapping functions, a cost function, Markov 
Chain Monte Carlo (MCMC) sampling series, and 
the Metropolis-Hasting criterion. Those new con- 
cepts are explained in Chapter 22 and illustrated 
with examples in the practice session of unit 6. 

Let us go over each of the seven steps 
(Figure 21.3). Note that cited symbols, page num- 
bers, figures, and tables in this section all refer to 
those in the paper by Xu et al. (2006). 


1. Step 1 is to define an objective. The general 
objective of the study was to predict ecosys- 
tem responses to elevated CO, concentration 
with uncertainty quantified. You can find this 
objective in paragraph 2 on page 1. A more 
specific objective of the study was to esti- 
mate parameter values, c,...c,, in Table 1 on 
page 3 of the paper by Xu et al. (2006) and 
then predict carbon pool changes at ambi- 
ent and elevated CO, treatments in Duke 
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Forest as described in paragraph 4 on page 2. 
Parameters, c,...c7, represent decomposition 
of organic carbon from litter and soil pools. 


. Step 2 is to have data sets. The data sets used 


in the study are soil respiration, woody bio- 
mass, foliage biomass, litterfall, soil carbon, 
and mineral carbon at the ambient and 
elevated CO, treatments. Those data sets are 
described in paragraph 6 on page 3 and 
Table 2 on page 4 of the paper by Xu et al. 
(2006). We also need to use standard devia- 
tions of the six data sets in data assimilation. 


. Step 3 is to have a model. This study uses 


the Terrestrial ECOsystem (TECO) model. 
The TECO model is depicted in Figure 1 
and described in equation 1 on page 2 of 
the paper by Xu et al. (2006). Please note 
that that figure has a typo. Pool X, should 
be a non-woody pool, which includes foli- 
age and fine root biomass. One more point 
is that the TECO model in this paper was 
described by the matrix equation, which 
is the same equation as we studied in units 
1-5 with slightly different notations. 


. Step 4 is to define a cost function. The cost 


function is to measure mismatches between 
observed and modeled values. The mis- 
matches are described by equation 5 for 
individual data sets and equation 6 for all 
the six data sets on page 4 of the paper by Xu 
et al. (2006).The modeled values have to be 
mapped to observations via a mapping func- 
tion ®, which has six elements, respectively, 
for six data sets, as described in equation 3 
on page 4. The mapping function is further 
illustrated in the practice session in unit 6 
of this book. The cost function is equivalent 
to the likelihood function in equation 8 on 
page 4 of the paper by Xu et al. (2006). 


„Step 5 is to minimize the cost function 


via global optimization. The optimization 
was done with Markov Chain Monte Carlo 
(MCMC) sampling series. The MCMC has 
two phases, a proposing phase and a mov- 
ing phase. The proposing phase is to gen- 
erate new parameter sets. The moving 
phase is to examine if the newly proposed 
parameters should be accepted or not using 
a Metropolis-Hastings criterion. The two 
phases of MCMC are described in para- 
graphs 12 and 13 on pages 4 and 5 of the 
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paper by Xu et al. (2006). The study uses 
five parallel runs to test convergence of sam- 
pling series with the Gelman-Rubin method 
as described in paragraphs 14 and 15 on 
page 5 of the paper by Xu et al. (2006). 


6. Step 6 is to estimate parameter values after 
the cost function is minimized in step 5.The 
estimated parameter values are described in 
paragraph 17 on page 5, depicted in Figures 
3 and 4, and Table 3 on pages 6-8 of the 
paper by Xu et al. (2006). Estimated param- 
eters have three types of distributions, or 
histograms, as shown in the mid-columns 
in Figures 3 and 4. The three types of histo- 
grams are bell-shaped for parameters c,, c, 
and c,, edge-hitting for c,, and flat for cz, cs 
and c, in Figure 4 on page 7 of the paper by 
Xu et al. (2006).The bell-shaped histograms 
indicate that those parameters are well con- 
strained by data. The flat-shaped histograms 
mean that those parameters are not con- 
strained by data. In other words, the six data 
sets do not have any information that is rel- 
evant to those parameters. The edge-hitting 
histograms mean that those parameters may 
be strongly correlated with other param- 
eters or caused by other reasons. 


7. Step 7 is to use the estimated parameters 
from step 6 in the TECO model to predict 
future states of ecosystems. The predicted 
carbon pool changes in the year 2010 
at the ambient and elevated CO, treat- 
ments are described in paragraph 17 on 
page 5, Figure 9 on page 11, and Table 4 on 
page 12 of the paper by Xu et al. (2006). 
Uncertainty in model prediction is quanti- 
fied with cumulative density functions in 
Figure 9 and confidence intervals in Table 4. 
CO, effects are represented by the predicted 
changes in pool sizes as indicated by the dif- 
ferences between the solid lines at elevated 
CO, and dashed lines at ambient CO, in 
Figure 9. As you can see from the paper by 
Xu et al. (2006), elevated CO, had the larg- 
est effects on woody biomass but no effects 
on the passive soil carbon pool. 


Again, the seven steps to conduct a data assimi- 
lation study are setting up an objective, collect- 
ing data sets, having a model, and defining a cost 
function, using a global optimization method to 


minimize the cost function, estimating parameter 
values, and predicting. 

The seven-step data assimilation is essential to 
estimate parameter values and constrain model 
predictions through parameterization. The param- 
eterization, in turn, is essential toward realistic 
model prediction. 


SCIENTIFIC VALUES OF DATA ASSIMILATION 


Data assimilation is a statistical tool. Its scientific 
values are realized when it is used to estimate 
parameter values, select alternative model struc- 
tures, quantify uncertainty, and/or evaluate values 
of different data sets to constrain model prediction. 
The study by Xu et al. (2006) has illustrated how 
data assimilation was used to estimate parameters 
and quantify uncertainty of estimated parameters 
and model predictions. Chapter 31 further explains 
how data assimilation is used to quantify uncer- 
tainty associated with different model structures. 
Briefly, the chapter shows a study conducted by 
Shi et al. (2018) evaluating three soil carbon mod- 
els: a classic or conventional model, a microbial 
model, MIMICS, and a vertically resolved model, 
CLM4.5, with three data sets: topsoil organic 
carbon in 0-30 cm, subsoil organic carbon in 
30-100 cm, and microbial biomass carbon. The 
three data sets can constrain subsets of parameters 
for all the three models while complex models 
generate larger uncertainties in predictions even 
with the same data sets to constrain parameters. 
Chapters 29 and 32 (i.e., practice in unit 8) 
explore how data assimilation can be used to eval- 
uate values of different data sets to constrain model 
predictions. The values of data sets are measured 
by information content according to the Shannon 
information index and quantified from probability 
density functions of predicted changes. In prin- 
ciple, data sets contain information mainly on the 
parameters that are related to relevant processes. 
For example, flux data, such as net ecosystem 
exchange and gross primary production, can usu- 
ally constrain parameters related to flux processes, 
such as leaf area index and maximal carboxylation 
rate. In comparison, pool data, such as plant bio- 
mass and soil carbon content, usually have infor- 
mation to constrain pool-related parameters, such 
as carbon transfer coefficients between pools. 
Data assimilation is often used to test alterna- 
tive model structures. For example, four alternative 
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models were evaluated for their representation of 
the priming effect on soil organic carbon decom- 
position with 84 data sets from 26 studies (Liang 
et al. 2018). Scientifically, the study evaluates 
whether the soil priming effect leads to net loss 
or net gain of soil organic carbon by adding new 
carbon input. Priming is a term to describe stimu- 
lation of old soil organic carbon (SOC) decompo- 
sition by new carbon addition. This priming effect 
usually results from stimulated microbial growth 
by new carbon addition. 

The study requires data to be collected from 
isotope-labeled carbon addition experiments. Those 
studies have to provide SOC content, the added 
amount of new carbon, multiple measurements 
of CO, emission rates from total SOC samples and 
from labeled new carbon substrate, from which the 
CO, emission rates from old SOC can be estimated. 
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The study uses four models, including conven- 
tional soil carbon dynamic model, new-old carbon 
interactive model, Michaelis-Menten model, and 
reverse Michaelis-Menten model. Data assimila- 
tion is used to integrate the 84 data sets with the 
four models to evaluate data-model matches by 
comparing observed with modeled cumulative 
CO, emission rates from the total SOC, old SOC, 
and new carbon substrate (Figure 21.4). The con- 
ventional model can fit observed carbon releases 
from new, old, and total SOC extremely well. The 
interactive model can fit observations quite well, 
too. However, the fitting is not good for Michaelis- 
Menten and reverse Michaelis-Menten models, 
especially with observed carbon releases from old 
and total SOC. 

The model-data fitting also shows that the conven- 
tional model fits the best and the Michaelis-Menten 
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Figure 21.4. An example showing the performances of different models in simulating cumulative CO, emissions from old and 


new C substrates. Dots and lines are observations and model simulations, respectively. Shaded areas are the simulated ranges 


from 2.5th to 97.5th percentiles (i.e., 95% range). Blue and red are CO, emissions from old C at the control and new C addition 


treatments, respectively; Black is CO, emissions from added new C.The distributions of simulated cumulative CO, emissions at 


the end of experiment are also shown in each panel. 
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model fits the worst. A deviance information cri- 
terion (DIC) was used to select alternative model 
structures. DIC penalizes the model severely by the 
number of parameters. Among the four models, DIC 
is the smallest for the interactive models but larg- 
est for the conventional model as the latter has 12 
parameters although the fitting of the model with 
data is the tightest. Based on the DIC, the most parsi- 
monious model is the interactive model. 

Then, the interactive model was used to esti- 
mate priming effect, replenishment, and net 
change in carbon after adding new carbon to the 
soil incubation experiments. Nearly 54% of the 
newly added carbon stays in the soil, stimulated 
old carbon decomposition via the priming effect is 
nearly 10% of the newly added carbon. As a con- 
sequence, SOC has a net gain of 32% of the added 
new carbon. This analysis indicates that priming 
does happen but is unlikely to lead to a net loss 
of SOC. 

This study used data assimilation to address a 
highly controversial issue on priming effect. The 
study shows that adding new carbon to soil does 
result in a priming effect but eventually results in 
net carbon gain. The increase in SOC is related to 
nitrogen content in the added substrates. The study 
shows how data assimilation was used to select 
alternative model structures and suggests that a 
two-pool interactive model is the most parsimoni- 
ous model to represent a SOC priming effect. 

Overall, data assimilation is a statistically rigor- 
ous approach to model parameterization, which 
is essential to generate realistic model prediction. 
Data assimilation is usually conducted by follow- 
ing a seven-step procedure. The seven steps are: (1) 
setting up an objective; (2) collecting data sets; 
(3) having a model; (4) defining a cost function; 
(5) using a global optimization method to mini- 
mize the cost function; (6) estimating parameter 
values; and (7) predicting. The scientific values of 
data assimilation can be realized when it is used 
to estimate parameters, select alternative model 
structures, evaluate values of data sets in constrain- 
ing model predictions, and quantify uncertainty in 
model prediction. 


SUGGESTED READING 


Xu, T., L. White, D. Hui, and Y. Luo. 2006. Probabilistic 
inversion of a terrestrial ecosystem model: Analysis 
of uncertainty in parameter estimation and model 


prediction. Global Biogeochemical Cycles, 20, GB2007, 
doi:10.1029/2005GB002468. 


QUIZZES 


1. What are the three elements, which all have 
to be perfectly lined up to realistically forecast 
future states of an ecosystem? Why? 


2. Data assimilation is a method 
a. to integrate data with model 
b. to calibrate a model with data 
c. to use statistical principles of analyzing data 
d. often used for parameter estimation 


3. Why is simulation modeling good for exploring 
ideas but not for prediction or forecasting? 


4. Why are field research and modeling scientifi- 
cally complementary? It is because: 


a. one makes measurement and the other uses 
computer 


b. they are two approaches to scientific inquiry 
of a research subject in different ways 


c. the field research records the state of an eco- 
system whereas the modeling explores rela- 
tionships among processes 


d. data from field research can be better inter- 
preted from process understanding whereas 
model forecast can be better constrained 
with data 


5. What are the seven steps of conducting a data 
assimilation study? 


6. A cost function is 
a. a function to calculate cost of training 


b. to measure mismatches between modeled 
and observed values for all the data sets 


c. a data set to be used in data assimilation 


d. a model to be optimized through data 
assimilation 


7. Posterior probability density function is 


a. to indicate carbon density after field 


measurement 


b. to indicate probability of a parameter value 
before data assimilation 


c. a function of time from prior to posterior 


d. to indicate a relative likelihood of an esti- 
mated parameter value after data assimilation 


8. Why do we need a global instead of local opti- 
mization method for data assimilation? 


180 DATA ASSIMILATION 


CHAPTER TWENTY-TWO 


Bayesian Statistics and Markov Chain Monte Carlo Method 
in Data Assimilation 


Feng Tao 


Tsinghua University, Beijing, China 


CONTENTS 


Introduction / 181 

Bayes' Theorem / 181 

Markov Chain Monte Carlo Method / 183 
Convergence of MCMC Results / 186 
Suggested Reading / 187 

Quizzes / 187 


Data assimilation is an effective way to integrate 
observations into models. We will demonstrate 
how parameters in a model may be estimated by 
data assimilation in such a way that model simula- 
tions best fit observations. Data assimilation based 
on Bayesian inversion is used to retrieve posterior 
distributions of model parameters from observa- 
tions. The Markov Chain Monte Carlo (MCMC) 
method is applied as a numerical method to home 
in on the parameter set that maximizes goodness 
of fit between model outputs and measurements. 


INTRODUCTION 


In this chapter, we will first give a brief introduc- 
tion to Bayesian statistics, which is the theoretical 
foundation of data assimilation. By learning the 
Bayes’ theorem, you will understand why we can 
get knowledge from the data. After familiarizing 
ourselves with the theory, we will then learn how 
to apply the theory to an algorithm, namely the 
Markov Chain Monte Carlo (MCMC) method, 
which is useful for optimization problems, like fit- 
ting a model with multiple parameters to obser- 
vational datasets. At the end of the chapter, we 
will discuss how to achieve stable optimization 
results with the MCMC algorithm, the so-called 
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simulation convergence problem. This chapter is a 
brief introduction to these topics to help readers 
build an intuitive understanding of data assimila- 
tion. If you want to know more about rigorous 
theoretical proofs of theorems and equations men- 
tioned in this chapter, you are encouraged to refer 
to the suggested reading at the end. 


BAYES’ THEOREM 


The Bayes’ theorem, in plain words, is a method 
for calculating the validity of thinking (Horgan 
2016). The “thinking” can be one’s hypothe- 
ses, claims, or propositions. The results (i.e., the 
updated validity of thinking) given by Bayes’ theo- 
rem is based on the best available evidence, such as 
the data, observation, or any information one can 
obtain. Bayesian thinking happens every day in our 
lives. Before we collect any related information and 
update our thinking over a particular question, we 
all do some initial thinking on the validity of the 
question, no matter if our attitude is initially skep- 
tical or credulous. When we get new related data 
or information from the real world, we will natu- 
rally update our thinking over the question. This 
evidence-based updated thinking is the primary 
character of Bayes’ theorem. 
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We can mathematically express Bayes’ theorem as: 


P(B|A)P(A) 


P(B) 


If we take events happening in our lives alpha- 
betically as A and B, the probability of event A hap- 
pening can be expressed as P(A). The probability of 
event B happening can be expressed as P(B). We use 
P(A|B) to indicate the probability of event A hap- 
pening given the fact that event B has happened. 
Analogously, P(B |A) represents the probability of 
event B happening given the fact that event A has 
happened. 

The Bayes’ theorem intuitively expresses that 
the probability of event A to happen given event 
B has already happened is proportional to the pos- 
sibilities of event B happening where event A has 
already happened, and, at the same time, event A 
happening. When we calculate the probability of 
event A happening given event B has already hap- 
pened (i.e., P(A|B)), we can assume event B has 
been well observed. The probability of event B 
happening (i.e., P(B)) then becomes a constant. 
P(A|B) is now proportional to a product of P(B | A) 
and P(A). We may know some information about 
A and can transfer such information into an initial 
thinking (i.e., P(A)). Based on the initial thinking, 
we can also get a sense of how likely it is for event 
B to happen given A has happened (i.e., P(B|A)). 
Mathematically we call P(A) our prior knowledge, 
while P(B | A) is termed the likelihood. Combining 
the prior knowledge and likelihood, we update 
our thinking as posterior knowledge (i.e., P(A|B)). 

We will use an example to illustrate how Bayes” 
theorem works. Suppose you are in an interna- 
tional meeting and receive a flyer advertising the 
training course “New Advances in Land Carbon 
Cycle Modeling”, which will train people in car- 
bon cycle modeling and data assimilation. You are 
interested and prepared to apply. Before the appli- 
cation, however, you feel concerned that your lack 
of experience in simulation modeling may hinder 
your success in the training course. You may won- 
der what the probability of your success in the 
training course will be, given you have no previ- 
ous modeling experience. By emailing the course 
coordinator, you learn that according to previous 
records, 80% of past trainees attending the train- 
ing course succeeded in understanding the train- 
ing material and finished the practices. In terms of 


P(A|B) = oc P(BJA)P(A) (22.1) 


the background of all trainees, half of them had no 
experience in modeling before the course. Among 
the successful trainees, 43.75% had no experience 
in modeling before the course. 

We can conceptualize this issue into Bayes’ the- 
orem. We can define the event A as one's success in 
the training course and the event B as the trainee 
having no modeling experience before the train- 
ing course. We now know P(B) = 0.5 (half of the 
trainees have no modeling experience before the 
training course); P(A) = 0.8 (80% of the trainees 
succeeded in the training course); and P(B|A) = 
0.4375 (among the successful trainees, 43.75% 
had no experience in modeling before the train- 
ing). By substituting all the factors into Equation 
22.1, we get P(A|B) = 0.7. The result implies that 
even if a trainee did not have any experience in 
modeling before the training course, the chance of 
success is 70%, which is still pretty large. 

The question could go the other way. Suppose 
you have prior experience in modeling, you may 
ask what the possibility of your being successful in 
the training course is. To answer this question, we 
need to calculate P(A| =B), where =B indicates the 
event of B not happening, and P(—B) = 1 — P(B). 
According to Equation 22.1, 


P(—B|A)P (A) 
AE) 


E [1-2 (8/4) |P(A) 


1-P(B) 


P(A|—B) = 


(22.2) 


when we substitute all the probability numbers we 
obtained from the coordinator into Equation 22.2, 
we get P(A|=B) = 0.9. So, if you have previous 
experience in modeling, the chance of success in 
the training course increases from 70% to 90%. 
We used discrete events and their possibilities 
in the above example to illustrate Bayes’ theorem. 
In simulation modeling, we use parameters that are 
continuous values to represent ecological processes. 
We assume that if we choose the correct parameter 
values, observations in the real world (e.g., soil 
organic carbon content) can then be predicted by 
the model. In this context, we use probability dis- 
tributions of parameters (as expressed by the prob- 
ability density function) to show all possible values 
of one parameter and the likelihood of occurrence 
for different values within that overall distribution. 
The Bayes’ theorem will help us, in a reverse way, 
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Figure 22.1. Schematic diagram of the Bayesian inference process. 


find the most likely distribution of parameter 0 to 
best describe the observed data x in a model: 


p (x10)p (0) ,, p(x10)p (0) (22.3) 


where the p(@) is the prior probability density 
function of parameter 6; p(x| 0) represents the 
likelihood that we use the model with 6 set to dif- 
ferent values from its distribution, and correctly 
predict the observation x. Combining the prior 
distribution p(@) with the likelihood p(x| 6) will 
give the posterior probability density function of 
6 (i-e., p(A| x)). 

Bayes’ theorem explicitly offers the possibility 
of using observational data to refresh our prior 
knowledge. The term p(x| 9) in Equation 22.3 
shows how likely we are to observe data x under 
different values of parameter 0 from its distribu- 
tion. This indicates if we propose different O val- 
ues and record them with their likelihoods (i.e., 
p(x| @)), we will eventually obtain the posterior 
distribution describing the optimized parameter 
values that best fit the observations (Figure 22.1). 


MARKOV CHAIN MONTE CARLO METHOD 


In data assimilation, the Markov Chain Monte Carlo 
(MCMC) method offers an algorithm that screens 
a range of possible parameter values under specific 
prior knowledge and retrieves the posterior distri- 
bution of the parameter. The MCMC method is com- 
posed of two parts, namely a Markov Chain and the 
Monte Carlo method. A Markov Chain is a stochastic 
model that describes a sequence of possible events. 
The probability of each event in a Markov Chain 
depends only on the state attained in the previous 
event. The Monte Carlo method, on the other hand, 
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represents a class of algorithms that sample events 
from a Markov Chain to fit an optimization target. 
Various algorithms exist for sampling Here, we will 
focus on the Metropolis-Hastings algorithm. 

We begin with an example to show the work- 
flow of the MCMC method. Suppose that we are 
organizing the training course “New Advances in 
Land Carbon Cycle Modeling”. The admission pro- 
cess is critical to the final outcome of the course. By 
scanning the profiles of applicants, we need to select 
those applicants who will most likely benefit from 
the training. The question at this stage is then formu- 
lated as “through which algorithm shall we select 
the participants from among all the candidates?” 

First, we may need to specify the topical scope of 
the training course so that we can define the target 
group who may be of interest (Figure 22.2).A clear 
and well-defined scope will help define the “per- 
fect candidate” who will definitely benefit from 
the training. For example, we may accept people 
who are doing research in the carbon cycle field, 
familiar with basic simulation modeling, and who 
would like to integrate models and data to enhance 
their understanding of ecological processes. 

After defining the scope of the training course, 
we may send out the flyers to advertise the course. 
We will then receive applications from people who 
are interested in and believe that they can benefit 
from the course. All the applicants now become 
the potential candidates as trainees. We now need 
to evaluate the match between the backgrounds of 
candidates and our training scope. The character- 
istics of the “perfect candidate” in our mind will 
function as a reference in such evaluation. 

We make decisions based on the assessments of 
candidates. Three possibilities will be on the table. 
We may find that the background of the applicant is 
very close to that of our “perfect candidate”, which 
means the applicant will most likely benefit from 
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Step 1: 
Initiation 
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accept 


Sadly reject 


Figure 22.2. An example of using the MCMC method to admit candidates based on their application details. 
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Figure 22.3. Workflow of the MCMC method. The Metropolis-Hastings algorithm is used to accept or reject proposed param- 


eter values in a Markov Chain. 


the training course. In this case, we will admit the 
applicant without hesitation. Conversely, we may 
also receive applications where the background of 
the applicant does not fit the scope of the train- 
ing at all. We will directly reject those applications. 
Another possibility is that the background of the 
applicant does not perfectly fit our “perfect can- 
didate”, but is still close enough to get likely ben- 
efits from our course. Instead of direct rejections, 
we may consider an algorithm to decide whether 
to accept or reject the applicant. For example, we 
may set new criteria to evaluate the fitness of these 
applicants to the training course and eventually 
admit those who meet the criteria. 

Four steps together make up the admission 
process. The first step initiates the process. We 
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need to set the scope of the training course and 
determine the target group who will be of inter- 
est. In the second step, applications provide all the 
possibilities we can choose from to fit our target 
of maximizing the overall benefit of the training. 
We then make decisions in the third step to accept 
or reject the applications according to some stan- 
dards. Finally, in step four, we repeat the procedure 
from steps two and three for each candidate. 

The MCMC method follows the same four steps 
to optimize parameters to fit model simulations 
with observations (Figure 22.3). In step one, we 
initiate the algorithm by first determining the 
optimization target (i.e., the observations). We set 
a prior range of the possible parameter values that 
remains consistent with the ecological meaning 
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of the investigated process, and simultaneously, 
allows sufficient flexibility in model simulations. 

The second step generates the Markov Chain. We 
propose a sequence of candidates of parameter val- 
ues so that we can select the candidates by a certain 
algorithm (in our example, the Metropolis-Hastings 
algorithm). A proposal distribution will be selected 
and serves to generate new candidate parameter val- 
ues. We then apply the proposed parameter values 
in the model and get the simulation results. 

We use two proposal distributions in step two, 
which are the uniform distribution in the test 
run and the normal distribution in the formal 
run (Haario et al. 2001). When we have no prior 
knowledge about the parameter, a uniform distri- 
bution will be a reasonable start. The uniform dis- 
tribution does not contain any shape information 
of the parameter distribution, but only sets the 
upper and lower limits of parameter values, which 
we determined based on our general understand- 
ing of the parameter in step one. The k' proposed 
parameter value is expressed as: 


gra gr 
D 


0*=0"'+r 


(22.4) 


where r is a random value drawn from the uni- 
form distribution in the interval between 0 and 1; 
D indicates the maximum step size in the proposal. 
For example, D = 5 indicates that the maximum 
step of the k* proposed parameter value (0*) from 
the k-1" proposed parameter value (0*- 1) is 20% 
of the prior range (¡.e., 07% — 0"). 

After the test run of MCMC with a uniform 
proposal distribution, we obtain some prelimi- 
nary information on the investigated parameter. 
The following formal run will then use the results 
of the test run as its prior knowledge so that we 
can more efficiently get the stable posterior dis- 
tribution of the parameter. A normal distribution 
will be assumed in the formal run. When we inves- 
tigate multiple parameters, we assume that the 
parameters follow a multivariate normal distribu- 
tion. The parameter values will be proposed as: 


SE ga + N (0,cov(6™))k < ko ne 


+N (0,cov(a%™™) > ko 


where we calculate the covariance of parameters 
by the results of the test run at the starting stage 


of the formal run (i.e., before the kẹ iteration). 
After some iterations of the formal run, we con- 
tinuously update the covariance information from 
the formal run results. The point k, is an empirical 
value we may set based on experience, depending 
on how fast we think the formal run can fully uti- 
lize the prior information in the test run. 

In step three, we apply the Metropolis-Hastings 
algorithm to accept or reject proposed param- 
eter values. We evaluate how the simulated results 
based on the proposed parameter values fit the 
observations and decide whether to accept the 
proposed parameter values or not. A cost func- 
tion serves to quantify the degree of discrepancy 
of the predicted values of the target variable x in 
the simulations relative to the observations of x. 
We express the cost function of the k proposed 
parameter (0*) as: 


where Y (inu — XA obs + describes the dis- 
i=l pr 
tance of the simulation results from observations 
with the sample size ofn (e.g., ten-year record of net 
primary productivity of a grassland site); o? is the 
variance of observations, which we can either calcu- 
late from measurements or from empirical knowl- 
edge. Equation 22.6 expresses the cost function for 
a single data set. When multiple data sets are used 
(e.g. ten-year record of net primary productivity, 
soil organic carbon content, and soil respiration of 
a grassland sites), the mismatches of individual data 
sets are added together, optionally with weightings 
to emphasize the importance of certain variables 
relative to the others, to obtain the cost function. 
Now we make use of Bayes’ theorem. From 
Equation 22.3, we can define the likelihood of the 
parameter value (0*) in fitting model simulations 
(Xi, mo) to observations (X; o») as: 


L£(0*) = p(0"|x) x p(x] 0") p(O") 22.7) 


We use the cost function in a monotonically 
decreasing exponential form to describe p(x| 0"): 


L(0*) œ exp(—A,) (22.8) 


FENG TAO 185 


Using Equation 22.8 to quantify the likelihood 
requires several assumptions in Bayesian statistics 
(Craiu and Rosenthal 2014). From the data opti- 
mization perspective, A, expressed in Equation 
22.6 offers a measurement of the deviation of 
model simulations from observations. We connect 
such a metric with the likelihood of the proposed 
parameter values: we assign a high likelihood to 
proposed parameter values when the cost function 
result is small and vice versa. 

We use the likelihood ratio of consecutively 
proposed parameter values in the Markov Chain 
to make the acceptance decision. The probability 
of accepting the k'* proposed parameter value (0*) 
based on the results of k-1* proposed parameter 
value (9*” 1) can be formulated as: 


k 
p(0'7,0*) = min oe). (22.9) 


L(0'") 


Substituting Equation 22.8 into Equation 22.9 
gives: 


p(0'>,0*) = min {1,exp(A,.—A,)} (22.10) 


When the cost function with 0! is smaller than 
that with 0* !, this tells us that the results of the 
present simulation are closer to the observations 
than the previous one. By Equation 22.10, we get 
P(0*=1,0*) = 1, which indicates we will definitely 
accept the proposed parameter value 0*. Conversely, 
if the cost function calculated from proposed 
parameters 0! is higher than that of 0*- !, this tells 
us that the simulation results by 0* in comparison 
to observations, is worse than that by 0*- !.We then 
get a P(0*-1,0*) result in the interval between 0 and 
1. In this case, we may or may not accept the pro- 
posed 0*, depending on a random chance. In prac- 
tice, we compare P(0* 1,0%) with a random value 
u that follows the uniform distribution u~U(0, 1). 
We accept the proposed 0* when P(0* 1,0%) > u. 
Otherwise, we will reject the proposed 0+. 

It may seem strange that we sometimes accept 
the proposed 6*, instead of a direct rejection when 
a proposed set of parameters yields simulation 
worse than the previous set (i.e., the new cost 
larger than the previous one, A, — A, _, < 0). In 
optimization, the “bad” results can be helpful in 
finding the global minimum of the cost function. 


The response surface of the cost function can be 
extremely nonlinear for a multivariate, high- 
dimension model. If we do not give a chance to 
accept a “bad” set of parameters, we may easily get 
trapped in a local optimum instead of the global 
optimum. Therefore, we assign a possibility of 
acceptance, subject to chance, when P(@*~ 1,0%) is 
in the interval between 0 and 1. 


CONVERGENCE OF MCMC RESULTS 


When we do MCMC, we need to confirm that the 
results are the same even if we start from differ- 
ent points along the Markov Chain. This is called 
testing for convergence (Gelman et al. 2014). In 
theory, the MCMC should converge as long as the 
Markov Chains are long enough. That means that 
the parameters from different independent MCMC 
simulations will share the same or very similar 
posterior distributions. 

The estimated parameters in the MCMC sim- 
ulation, however, do not necessarily resemble 
each other at the very beginning among different 
Markov Chains. In the test run and starting stage 
of the formal run, accepted parameter values may 
experience a different pathway to find the global 
optimum. However, after sufficient iterations of 
MCMC simulation (e.g, 10,000 iterations), the 
observations will eventually constrain the param- 
eter to the same space, where the model simulates 
most closely to the measurements entering the 
cost function. By plotting the sampling series, we 
can visually determine when the accepted param- 
eter values are stably constrained and set the early 
stage of the sampling series as the burn-in period. 
The exact length of burn-in period can be empiri- 
cal. The first half of the accepted parameter chain 
is a safe setting for burn-in period given sufficient 
iterations in MCMC simulation (e.g., 100,000 
iterations). In the final analysis, we exclude the 
burn-in results and only use the results after con- 
vergence to generate the posterior distribution. 

Both visual and statistical assessment can be used 
to determine the convergence of MCMC results. The 
degree of overlap among different MCMC sampling 
series is a rough indicator of the convergence. We 
can also quantify the convergence statistically. The 
Gelman-Rubin (G-R) statistics are one option. The 
G-R statistics quantify the differences among differ- 
ent runs (B,) and the fluctuation of accepted param- 
eter values within the same run (W;): 
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where i denotes parameters investigated in the 
study; K is the number of parallel runs; N is the 
length of each run; c!* represents the n™ accepted 
value of parameter i in the k" parallel run after the 
burn-in period. The G-R statistic is then defined as: 


ón EEE oi 


W; 


Once convergence is reached, GR; should 
approximately approach 1. 

In summary, this lecture introduces two fun- 
damental underpinnings of data assimilation. The 
first, the Bayesian inference, sets the theoretical 
foundation of improving model performance by 
observations. The second, the Markov Chain Monte 
Carlo method, provides a numerical method to 
assimilate observational data into a model based 
on an optimization of the goodness of fit of the 
model output to the observational data. In the next 
chapter, using examples from different ecological 
studies we will show you how data assimilation 


can be deployed to advance understanding of the 
land carbon cycle. 
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QUIZZES 


1. Briefly describe the Bayes’ theorem. 


2. Why do we need a test run before the formal 
run when we know little about the prior? 


3. Explain why we need to sometimes accept 
parameters that have a larger cost function value 
(A) than the previously accepted ones. 


4. Why do we discard the results from the burn-in 
period? 

5. We need to generate a random value (u) to com- 
pare with the results exp(Apre - Anew) when the 
cost function value is larger than the previous 
one (Anew > Apre). Why do we accept the pro- 
posed parameter values only when u < exp(Apre 
- Anew), instead of u > exp(Apre - Anew)? 
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Soil incubation is a widely used technique in 
studying soil organic carbon cycling. Integrating 
soil incubation data with soil carbon models 
can potentially reveal mechanisms of soil carbon 
dynamics underlying observations. This chapter 
is to illustrate how data assimilation is applied to 
analyze data from soil incubation experiments 
using soil carbon models. After a brief introduc- 
tion to soil incubation experiments and soil carbon 
models, a three-pool model is used to illustrate 
the seven-step procedure of data assimilation for 
the analysis of soil incubation data. As two critical 
aspects, different cases in the optimization step and 
the dependence of parameter acceptance rate on 
cost function are described with detailed examples. 


SOIL INCUBATION EXPERIMENTS 


Soil incubation experiments are commonly carried 
out for studying soil organic carbon and nitrogen 
cycling processes. Typically, fresh soil samples 
are collected from the field, crushed, sieved, and 
mixed before a certain amount of soil is placed in a 
container (e.g., a Mason jar). The container is then 
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exposed to different treatments of, for example, 
temperature and moisture. For carbon decomposi- 
tion studies, carbon dioxide (CO,) emission rate 
(or respiration rate) is usually measured repeat- 
edly during the incubation period. Data from soil 
incubation studies are usually plotted by either 
CO, emission rates or cumulative CO, emission 
over time. Compared with in situ observations in 
the field, soil incubation experiments allow ready 
control of environmental factors and reduce het- 
erogeneity and confounding effects from many 
processes and factors. Thus, results from such 
experiments can facilitate mechanistic under- 
standing of soil carbon processes, such as the 
turnover rates and temperature sensitivity of soil 
organic carbon decomposition. However, incu- 
bation experiments usually isolate the soil from 
plants and other components in ecosystems, and 
thus may show different behavior from soils in a 
full ecosystem setting. 

To illustrate how soil incubation data may be 
used to inform modeling via data assimilation, this 
chapter takes as an example the study by Liang et al. 
(2015), which compares different methods for 
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Figure 23.1. Cumulative CO, emission (R) at 25%C and 35*C. Data originally from Haddix et al. (2011) and used in Liang et 
al. (2015). Modified with permission from Soil Biology and Biochemistry: Liang et al. (2015). 


estimating temperature sensitivity of soil organic 
carbon decomposition. The data are from a study 
by Haddix et al. (2011). Fresh soils were sampled 
from the field and transported to the laboratory. In 
the laboratory, the soil samples were sieved, and 
visible roots and rocks were picked out. From each 
soil sample, subsamples were taken and incubated 
in jars at different temperatures. The first seven days 
were treated as pre-incubation to minimize the 
possible artificial influences of field sampling, siev- 
ing and transportation. After the pre-incubation, 
CO, emission rates were measured as the rate of 
CO, concentration increase in the head space of the 
jars over time. CO, emission rates were measured 
daily during the first two weeks of incubation, 
weekly for the next two weeks, and every four 
weeks thereafter. Overall, there were 36 sampling 
occasions over the 588-day incubation period. 
Cumulative CO, emissions can be calculated by 
adding together the daily amounts of emitted CO, 
from day 0 (Figure 23.1). These example data will 
be used for the following illustration in this chapter. 

Conventionally, total CO, emissions and/or CO, 
emission rates during the incubation period are 
directly compared to reveal the effect of different 
treatments, such as temperature or moisture levels. 
For example, temperature sensitivity impacts the 
decomposition of soil organic carbon under climate 
warming. In the past, the temperature sensitivity of 
soil organic carbon decomposition was directly cal- 
culated as the CO, emission rate at a higher temper- 
ature, Ri, divided by that at a lower temperature, 
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Riow (Rey and Jarvis 2006). This estimate usually 
underestimates the temperature sensitivity after the 
initial incubation stage because greater decomposi- 
tion results in less substrate at high than low tem- 
peratures at the same point of incubation time. In 
addition, the direct comparisons of total CO, emis- 
sions and/or CO, emission rates may not reveal 
processes underlying soil carbon dynamics, such 
as turnover rates of different carbon components 
and dependences of temperature sensitivity on 
substrate quality. When soil carbon models, which 
explicitly represent such processes, are integrated 
with data from soil incubation experiments via data 
assimilation, we can potentially learn more about 
the process responses underlying the observed soil 
carbon dynamics. 


SOIL CARBON MODELS 


The general framework and underlying principles 
of many current soil carbon models were presented 
in chapters 1 and 2. Generally, first-order kinetics 
are used to simulate soil carbon cycling (Stanford 
and Smith 1972, Andren and Paustian 1987). The 
simplest soil carbon decomposition model simu- 
lates soil carbon dynamics as a single pool (Figure 
23.2a). However, one-pool models usually perform 
poorly, since soil organic matter is compartmental- 
ized into pools of different lability, due to various 
structures of substrates and protection mecha- 
nisms. As a result, multiple-pool models are most 
commonly used to simulate soil carbon dynamics. 
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Figure 23.2. Schemes of soil carbon models using first- 
order kinetics. Modified with permission from Soil Biology and 
Biochemistry: Liang et al. (2015). 


For example, a two-pool model divides soil carbon 
into active and slow pools (Figure 23.2b), while a 
three-pool model divides soil carbon into active, 
slow and passive pools (Figure 23.2c). The two- 
and three-pool models simulate soil carbon cycling 
without transfers among pools. Another type of 
model simulates transfers among pools (Figure 
23.2d). Similar to ecosystem models discussed 
in previous chapters, soil carbon models can be 
described in a matrix form (Chapter 5). The only 
difference when using soil carbon models to repre- 
sent incubation experiments is that they have initial 
soil carbon pools size(s) at the very beginning of 
the experiment, but do not have carbon inputs. 

Here we will take the three-pool model with- 
out transfer (Figure 23.2c) as our example to illus- 
trate the procedure of data assimilation for analysis 
of soil incubation data. The three-pool model can 
be described as: 


23.1 
ji (23.1) 
where 
k 0 
K=diag(k,)=| 0 k Of, (23.2) 
0 k 
and 
x(t) 
X(t) =| Xx, (1) (23.3) 


k,, k,, k, are the turnover rates of the active, 
slow and passive pools, and X,, X,, X, are their pool 
sizes, respectively. 


APPLICATION OF DATA ASSIMILATION TO SOIL 
INCUBATION DATA 


Our data assimilation example follows the seven 
steps introduced in Chapter 21. Before we start, 
let us revisit the seven-step procedure again: (1) 
defining an objective; (2) preparing data; (3) 
choosing a model; (4) using a cost function; (5) 
applying an optimization method; (6) estimating 
parameters; and (7) generating predictions. 


Step 1: defining an objective. The objective of the 
study by Liang et al. (2015) is to compare 
different methods for estimating tem- 
perature sensitivity of soil organic carbon 
decomposition. They first reviewed pub- 
lished methods for temperature sensitiv- 
ity. The temperature sensitivity is usually 
expressed as Q,), measuring the propor- 
tional change in soil carbon decomposition 
rate for a 10 K warming. They found that 
many studies directly compare CO, emis- 
sions at different incubation temperatures, 
estimating an apparent temperature sensi- 
tivity of soil organic carbon decomposition. 
However, this method does not provide the 
intrinsic temperature sensitivity of different 
soil carbon components. As introduced ear- 
lier, soil carbon models consider soil carbon 
dynamics in different pools depending on 
their turnover rates (k), which can be used 
to represent substrates with different lability. 
How the parameter k changes with temper- 
ature can inform the intrinsic temperature 
sensitivity of different soil carbon compo- 
nents. Therefore, a specific objective of the 
study is to estimate the intrinsic temperature 
sensitivities of different carbon pools in soil 
models. 


Step 2: preparing data. As mentioned above, the 
data come from an incubation experiment 
by Haddix et al. (2011). For the illustration 
of this chapter, data from the 25°C and 35°C 
treatments (Figure 23.1) are used. Means 
and standard deviations of measurements 
from the incubation experiment are needed 
for data assimilation. The data are organized 
as shown in Table 23.1. 
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TABLE 23.1 
Cumulative CO, emission during the incubation 
organized for data assimilation 


25°C 35°C 
Incubation time 


(days) Mean SD Mean SD 


= 


0.025 0.002 0.039 0.003 


2 0.048 0.004 0.076 0.001 
3 0.072 0.003 0.112 0.003 
4 0.099 0.007 0.147 0.004 
5 0.121 0.007 0.182 0.006 
6 0.142 0.009 0.214 0.008 
7 0.165 0.007 0.250 0.009 
8 0.180 0.006 0.267 0.023 
9 0.212 0.005 0.306 0.019 
10 0.226 0.004 0.347 0.014 
11 0.244 0.003 0.389 0.021 
12 0.271 0.010 0.434 0.010 
13 0.288 0.011 0.487 0.015 
14 0.312 0.016 0.528 0.015 
21 0.459 0.026 0.718 0.014 
28 0.585 0.055 0.819 0.053 
56 1.099 0.113 1.290 0.155 
84 1.478 0.139 1.764 0.219 
112 1.770 0.161 2.121 0.176 
140 1.968 0.200 2.477 0.175 
168 2.149 0.229 2.797 0.151 
196 2.294 0.265 3.087 0.126 
224 2.406 0.334 3.326 0.132 
252 2.508 0.353 3.535 0.129 
280 2.609 0.386 3.719 0.135 
308 2.738 0.407 3.900 0.146 
336 2.865 0.384 4.076 0.168 
364 2.977 0.395 4.235 0.192 
392 3.088 0.400 4.373 0.148 
420 3.207 0.398 4.535 0.150 
448 3.314 0.399 4.691 0.155 
476 3.414 0.418 4.829 0.159 
504 3.525 0.429 4.962 0.159 
532 3.630 0.435 5.087 0.154 
560 3.745 0.445 5.210 0.156 
588 3.853 0.444 5.326 0.172 


Step 3: choosing a model. Models are chosen 


dependent on the study objective. Liang 
et al. (2015) uses four models to compare 
different estimates of Q,,. In practice, the 
length of the incubation experiment is an 
important aspect to consider when choos- 
ing which model to use. If the length of 
experiment is relatively short, for example, 
days to months, the one-pool or two-pool 
models may be appropriate. If the incuba- 
tion lasts longer, for example, years, the 
three-pool model may be better to fit data. 
Here, our purpose is to illustrate how to use 
data assimilation to analyze soil incubation 
data, and the example experiment lasts for 
588 days. Therefore, the three-pool model is 
chosen. The three-pool model has a total of 
eight to-be-determined parameters, includ- 
ing the initial fractions of active and slow 
pools, f, and f,, decomposition rates of 
organic carbon in three pools at 25°C, k,, k,, 
k,, and corresponding temperature sensitiv- 
ity parameters, q,, q, and q;. Details of these 
parameters are shown in Table 23.2.The ini- 
tial fraction of the passive pool, f,, does not 
need to be estimated as it can be directly cal- 
culated as 1 — f, — f,. The turnover rates of 
carbon pools at 35°C can be calculated as k, 
x 4; If your experiment does not have treat- 
ments at different temperatures, just ignore 
those temperature sensitivity parameters. 
In that case, the model has five parameters 
to be estimated. To conduct data assimila- 
tion, the prior probability density functions 
(PDFs) of the parameters are needed, which 
represents the prior knowledge ahead of 
data assimilation. When there is not much 
prior knowledge about the parameter dis- 
tributions, the prior PDFs can be specified 
as uniform distributions over parameter 
ranges, for example based on available lit- 
erature (Table 23.2). 


Step 4: cost function. Before the cost function is 


estimated, a mapping function is needed to 
relate the simulation results to their corre- 
sponding measurements. Looking at Table 
23.1, the 15th measurement is conducted in 
day 21. However, the 1 5th model simulation 
is CO, emission in day 15, and the model 
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TABLE 23.2 


A description of model parameters and the ranges 
of their prior uniform distributions of the three-pool 


models used in the chapter 


Prior uniform 


distribution 
Parameter Unit Description Minimum Maximum 
fi Unitless Initial fraction 1.00 x 1.00 x 
of the 02 107! 
active pool 
f, Unitless Initial fraction 1.00 x 6.00 x 
of the slow 07! 107! 
pool 
k, d+ Turnover rate .00 x 2.00 x 
of the 0-3 1072 
active pool 
k, del Turnover rate .00 x 5.0 x 
of the slow o> 10-* 
pool 
k, d! Turnover rate .00 x 5.0 x 
of the 074 107° 
passive 
pool 
qı Unitless Temperature 1.00 3.00 
sensitivity 
of the 
active pool 
4 Unitless Temperature 1.00 4.00 
sensitivity 
of the slow 
pool 
q3 Unitless Temperature 1.00 5.00 
sensitivity 
of the 
passive 
pool 


After deriving the subset of model results 
that correspond with observations, the cost 
function, J, which measures the magnitude 
of mismatch of the simulation results from 
observations, can be specified. We will adopt 
the formula for J introduced in Chapter 22. 
Figure 23.3 shows a snapshot of the J values 
from iteration 101 to 200 when running data 
assimilation using the three-pool model and 
data inTable 23.1.The cost function is used in 
the optimization step to determine whether 
to accept or reject proposed parameter sets, 
which is further discussed in the next step. 


Step 5: optimization. Let us follow the intro- 


duction to the Markov Chain Monte Carlo 
(MCMC) method in Chapter 22 to conduct 
the optimization. The value of the cost func- 
tion in iteration m is compared with that in 
iteration m-1. There are basically two cases 
when comparing the values of cost func- 
tion, Ja < J,, and Jn > Ja. Examples below 
are used to show the two cases. 


Case 1: In Figure 23.3, J,,, = 21.3 and 


simulation for day 21 is the 21st result. Our 
mapping function will gather a subset of 
model results that correspond to each mea- 
surement occasion. We can build a function 
between the incubation time, t, and the 
rank of measurements, i: t = f(i), using the 
first column of Table 23.1. The incubation 
time corresponding to a specific measure- 
ment can be derived via the mapping func- 
tion. For example, the incubation time of 
the 36th measurement is day 588 (588 = 
f(36)). With this mapping function (î.€., t = 
f(i)), you can easily locate the model results 
corresponding to the 36th measurement is 
in day f(36) (i.e., day 588). 


JUNYILIANG AND JIANG JIANG 


Jiu = 221.2, respectively. 21.3 is smaller 
than 221.2, meaning the mismatch between 
model results and measurements is reduced, 
i.e., the model simulation is improved. In this 
case, the proposed parameters are accepted. 
Case De Ji = NED aml Ji, = (582, 
respectively (Figure 23.3). 114.1 is big- 
ger than 69.2, meaning the mismatch 
between model results and measurements 
is increased, i.e., the model simulation is 
worsened. In this case, the exponential of 
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Figure 23.3. A snapshot of the cost function values from 


iteration 101 to 200. 
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O172J173) is calculated. Here the value is 3 
x 107%, The value is then compared with a 
randomly number between 0 and 1. If the 
randomly selected number is smaller than 
3 x 102%, the proposed parameters are 
accepted. Otherwise, the proposed param- 
eters are rejected. Given the number to be 
compared is a random value between 0 
and 1, the chance for accepting proposed 
parameters is (3 x 10-15)%. Let us take a 
look at another example, Jis = 31.9 and Jis 
= 29.6, respectively (Figure 23.3). 31.9 is 
slightly bigger than 29.6. Therefore, we cal- 
culate the exponential of (Ji sg 3,59), which is 
1 x 102, suggesting there is a chance of 1% 
to accept the proposed parameters. 


Recall that the purpose of data assimilation is 
to derive global optimizations for parameters. 
Allowing a chance to accept proposed parameters 
even with the increased mismatch between simu- 
lations and observations can avoid the parameter 
estimations getting trapped in local optimizations. 
But the chance to accept proposed parameters 
decreases steeply with the increase in simulation- 
observation mismatch (Figure 23.4). 


Step 6: estimating parameters. Now we have all the 
accepted parameters after finishing 100,000 
MCMC iterations. Excluding results in the 


Acceptance rate (%) 


-100 -50 0 50 100 
Id m-1 

Figure 23.4. Dependence of parameter acceptance rate 
on the change of cost function (J,J,,_,). If the mismatch 
between simulations and observations decreased („+ € 0), 
the acceptance rate is 100%. IfJ,,—J,,-; > 0, there is a chance to 
accept proposed parameters, and the acceptance rate decreases 
steeply with the mismatch increase. 


burn-in period (Chapter 22), we have poste- 
rior distributions for the selected parameters 
(Figure 23.5). In the figure, the distributions 
of parameters related to the active and slow 
pools have significant single peaks, which 
means these parameters are well constrained. 
The distributions of parameters related to 
the passive pool, by contrast, are not as good, 
suggesting the data may not have enough 
information to constrain them. 


From the posterior distributions, we can derive the 
maximum likelihood estimates (MLEs) of param- 
eters. MLEs represent the parameter value at which 
the distribution peaks. In the example, the Q,, dis- 
tributions peak at 1.22 and 1.76 for the active and 
slow pools. We can use these values to represent the 
intrinsic temperature sensitivities at 25°C of the cor- 
responding pools. The Q,, distribution of the pas- 
sive pool is not well constrained, and we can use the 
mean value, 2.67, as a reasonable guess of its intrin- 
sic temperature sensitivity at 25°C. The temperature 
sensitivity increases with the decrease of substrate 
lability, suggesting that carbon pools with longer 
turnover times are more vulnerable to warming. 


Step 7: generating predictions. predictions of soil 
carbon decomposition can be generated 
with the derived parameters and the model. 
With the generated predictions, we can 
evaluate the model performance by com- 
paring observations and simulations. In 
Figure 23.6, if all dots are right on the 1:1 
line (i.e., y = x), it means the model simu- 
lations are exactly equal to observations, 
which of course is very unlikely. In prac- 
tice, we can use a regression line to describe 
the comparison of model simulations and 
observations. The regression lines at 25°C 
and 35°C are y = 0.9919x and y = 0.9998x, 
or very close to y = x which would repre- 
sent perfect agreement. This indicates that 
the constrained model fits the data very well. 


With the constrained model, we can also simulate 
the dynamics of different carbon pools during the 
incubation. Figure 23.7 shows that the active pool 
dominates the CO, emission at the early stage and 
is depleted very fast. At the end of incubation, the 
cumulative CO, emission from the active pool is 
similar between the 25°C and 35°C incubations. 
The contributions of the slow and passive pools to 
CO, emissions gradually increase after the active 
pool is depleted. 
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Figure 23.5. Posterior probability density functions of the parameters in the three-pool model. Modified with permission from 
Soil Biology and Biochemistry: Liang et al. (2015). 
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Figure 23.6. Comparison of observed and modeled cumulative CO, emissions (R) in the incubation experiment. All the dots 
distribute around the 1:1 line, suggesting the model simulations are well matched with observations. 
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Figure 23.7. Observed and modeled cumulative CO, emissions (R) from individual and total pools at two incubation tempera- 


tures. Modified with permission from Soil Biology and Biochemistry: Liang et al. (2015). 


SUMMARY 


This chapter introduced the application of data 
assimilation to soil incubation experiments. A 
variety of models can be chosen to simulate soil 
carbon cycling depending on the scientific ques- 
tion and the length of incubation. Assimilating soil 
incubation data with models can help understand 
processes underlying the observed soil carbon 
dynamics, such as turnover rates and temperature 
sensitivities of substrate with different qualities. 
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QUIZZES 


1. How can soil carbon models and data assimi- 
lation help with the analysis of soil incubation 
data? 

2. Why is the length of incubation experiments 
an important consideration when choosing a 
model? 

3. Suggest why some parameters can be well con- 
strained and other cannot, and how this infor- 
mation guides future experimental design. 
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This chapter will guide you to learn the seven-step 
procedure of data assimilation by replicating a pub- 
lished study on the assimilation of observational 
data into the TECO model to calibrate decomposi- 
tion rate parameters for multiple soil organic mat- 
ter pools. We will first review the seven steps in 
Chapter 21 and learn how to program these steps 
using code examples in Python. Then we will per- 
form three exercises to reproduce figures from an 
earlier paper on the study with the CarboTrain tool- 
kit. It is recommended that you go over the source 
code of the three exercises in CarboTrain. You are 
expected to program a data assimilation algorithm 
with your own model or data sets after this practice. 


INTRODUCTION 


The seven steps in data assimilation (DA) are 
described in detail in Chapter 21. They are: 
(1) defining an objective; (2) preparing data; 
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(3) choosing a model; (4) using a cost function; 
(5) applying an optimization method; (6) estimat- 
ing parameters; and (7) generating predictions 
(Figure 24.1). This practice will use an example 
from a study by Xu et al. (2006) to introduce how 
to program these seven steps. The example of the 
DA study by Xu et al. (2006) is programmed in 
a Python file‘Probabilistic inversion. 
py’. We will use this Python program to illustrate 
each of the seven steps in a DA study. 


STEP 1: DEFININGAN OBJECTIVE 


Defining an objective usually means to decide target 
parameters to be estimated by DA. The target param- 
eters chosen in the study by Xu et al. (2006) are 
decomposition rates to represent fractions of carbon 
(C) leaving seven soil organic matter (SOM) pools 
of the Terrestrial ECOsystem (TECO) model (Table 
24.1). The TECO model will be described in Step 3. 
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Figure 24.1. Seven-step procedure of the data assimilation (DA) 
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study by Xu et al. (2006). 


STEP 2: PREPARING DATA 


The DA study in Xu et al. (2006) integrated six 
observations (i.e., soil respiration, woody bio- 


Parameter’ Description Unit ea oe mass, foliage biomass, litterfall, C in forest floor, 
and C in forest mineral soil) under ambient and 
Ci Fraction of gCg d! 1.764 2.95 elevated CO, treatments. All the 12 data sets are 
C leaving saved in six data files under the ‘/input’ folder. 
pooni There are three rows in each data file. The first row 
c Fraction of gCg'd! 0.548 0.274 is the time series, the second is observation under 
Ca ambient CO, treatment and the final row is obser- 
pool vation under elevated CO, treatment. Figure 24.2 
Ca pe of gg d! 5479 27.34 shows the Python code for reading the six data files 
re: and calculating the six observation variances. The 
peel? variable ninput with a value of 1 or 2 is to indicate 
Cs econ oF ges aci, pe which CO, treatment is applied in the Python pro- 
emg gram. For example, if ninput has a value of 1, the 
poos variables soilResp, woody, foliage, litterfall, forestFloor, and 
6 Fraction of gegi 27.4 6.85 forestMineral save the six observations from the ambi- 
ai ent CO, treatment. All these six observations are 
collected in the obsList variable and the correspond- 
có raion Si a Date -0274 ing six observation variances are saved in the varList 
e variable. These variables will be used in Step 4. 
c Fraction of gCg d”! 0.0137 0.00913 
C leaving STEP 3: MODEL 
ee A terrestrial ecosystem (TECO) model is used (Xu 
Based on the study by Xu et al., (2006) etal., 2006).TheTECO model has a seven-pool struc- 
ture and the fractions of C exiting each pool each 
day are the parameters to be estimated (Table 24.1, 
the variable c in Figure 24.3a).This step involves two 
198 PRACTICE 6 


209 ## Step 2: 6 Data Sets and their variances 
210 ## initialize observation-related varibles 


21 soilResp=np.loadtxt('./Source_code/unit_6/input/SoilRespiration.txt')[[@,ninput],] 

212 woody=np. loadtxt('./Source_code/unit_6/input/Woody.txt')[[@,ninput],] 

213 foliage=np.loadtxt('./Source_code/unit_6/input/Foliage.txt')[[@,ninput], ] 

214 litterfall=np.loadtxt('./Source_code/unit_6/input/Litterfall.txt')[[@,ninput], ] 

215 forestFloor=np.loadtxt('./Source_code/unit_6/input/ForestFloor.txt')[[09,ninput],] 

21 forestMineral=np.loadtxt('./Source_code/unit_6/input/ForestMineral.txt')[[0,ninput], ] 

217 44 collect the 6 data sets into an observation list 

218 obsList=[woody[1,],foliage[1,],litterfal1[1,],forestFloor[1,],forestMineral[1,],soilResp[1,]] 


219 ## variance of 6 observations 


+ ddof=1 non-bias estimator for sample variance 


varList=[np.var(obs, ddof=1) for obs in obsList] 


Figure 24.2. The Python code for reading six observations from text files and calculating their variances. 


50 ## Step 3: TECO model 
51 def run_model(c): 


52 """Eorward run model. 

53 Global args: tau, Ð; u, X0, Nt, cbnScale""" 

54 ## x stores 7 simulated carbon pools simulated in 5 years (daily scale) 

55 x=np.zeros([7,Nt+1], dtype=float) 

56 ## multiple matrix A and diagnol matrix C 

57 AC = np.matrix([[-c[01, 0, 0, 0, 0, ©, Ol, 

58 [0, -c[1], 0, 0, 0, 0, 0], 

59 [0.7123 x c[@], 0, -c[2], 0, 0, 0, Ol, 

60 [0.2877 * cl], c[1], 0, -c[3], 0, 0, 01, 

61 [0, 0, 0.45 x c[2], 0.275 * cl3], -cl4], 0.42 * c[5], 0.45 x c[6]], 

62 [0, 0, 0, 0.275 * c[3], 0.296 x c[4], -c[5], 0], 

63 [0, 0, 0, 0, 0.004 x c[4], 0.03 x c[5], -cl6]]]) 

64 x[:,0] = xð 

65 ## simulate for 5 years 

66 for i in range(1,Nt+1): 

67 ## Eq. 1 in Xu et al., (2006) 

68 x[:,i]=np.dot((np.eye(7)+AC*xtau[i-1]),x[:,i-1])+np.asarray(b)*xu[i-1]*cbnScale 

69 ## simulated carbon pools with 7 rows and Nt columns 

70 xsimu=x[:,range(1,Nt+1)] 

90 return [woody_simu, foliage_simu, litterYr, forestFloor_simu, forestMineral_simu, 
soilResp_simu] 

(b) 

249 ## Step 3: TECO model 

250 ## running model, return [woody_simu, foliage_simu, litterYr, forestFloor_simu, 


forestMineral_simu, soilResp_simu] 


251 simuList = run_model(c_new) 


Figure 24.3. The Python code for (a) defining a function of TECO model simulation; (b) calling the function to run the model. 


program fractions in Probabilistic_inver- 
sion.py: defining model simulation, and run- 
ning model simulation (Figure 24.3). A function 
run_model (Figure 24.3a) defines the TECO model 
simulation with a set of parameter values (vari- 
able c). The core codes of model simulation (lines 
65-68) are to program the differential equation: 


ZO £ (0) acx(t) +BU(2) 


i (24.1) 


where X(t) is a 7 X 1 vector to represent the 
seven C pool sizes at time t; €(t) is a scaling func- 
tion describing environment effects at time t; A is 
a 7 X 7 matrix representing the C transfer coef- 
ficients among the seven pools; C is a 7 x 7 diago- 
nal matrix with diagonal elements describing the 
fraction of C leaving each pool or the parameters 
to be estimated; B = (0.25,0.3,0,0,0,0,0)" is a 
vector accounting for the partitioning coefficients 
of C input to non-woody and woody biomasses; 
U(t) is the C input from photosynthesis at time t. 
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The variable Nt (Figure 24.3a) is five-year simula- 
tion time on a daily time step, with a value of 1825. 
The variable xsimu will accrue the seven C pool sizes 
over the Nt simulation days. Rather than the seven C 
pool sizes, the return values of the run_model function 
are six simulated data sets (i.e., woody_simu, foliage_simu, 
litterYr, forestFloor_simu, forestMineral_simu, and soilResp_ 
simu) to match with the six observational datasets. 
Mapping operators are used for this purpose and we 
will learn about these operators in Step 4. 

To run the model simulation, a new set of 
parameter values (variable c_new) needs to be 
passed to the run_model function (Figure 24.3b). 
During this process, the parameter values in vari- 
able c_new will be assigned to variable c in Figure 
24.3a. Variable simuList saves the return values of 
the model simulation (i.e., the six simulated data 
sets). In Step 4, the cost function will calculate the 
mismatch between the simulated data sets (simu- 
List) and the observed ones (obsList in Step 1). 


STEP 4: COST FUNCTION 


The cost function quantifies the discrepancy 
between simulated data sets (i.e., simuList) and 


50 ## Step 3: TECO model 
51 def run_model(c): 


observed ones (i.e., obsList). In the TECO model 
simulation, function run_model calculates the seven 
C pool sizes over the course of five years, with 
results saved in variable xsimu. To be comparable 
with six observed data sets (obsList), xsimu needs to 
be converted to six simulated data sets (simuList) 
with mapping operators (i.e., phi_litterfall, phi_slResp, 
phi_woodBiom, phi_foilageBiom, phi_cForestFloor, phi_cMin- 
eral in Figures 24.4 and 24.5). Before mapping, 
function ‘run_model’ needs to update the values of 
Phi_slResp and Phi_litterfall mapping operators with 
parameter values c (Figure 24.5b). After mapping, 
the mismatch between simuList and obsList is calcu- 
lated according to: 


where P(Z|c) is conditional probability den- 
sity of observations Z on parameters c, i.e., the 
likelihood function of parameter c; o, is the ith 


69 ## simulated carbon pools with 7 rows and Nt columns 

70 xsimu=x[:,range(1,Nt+1) ] 

71 ## use mapping function to covert 7 carbon pools to 6 simulated data sets 

72 ## get yearly simulated litterfall 

73 litterDaily=tau[:-1] * np.dot(phi_litterfall, xsimu) 

74 litterYr = [sum(litterDaily[range((i-1)*365, i*365)]) for i in range(1,6)] 

75 ## get simulated soilResp 

76 # convert float to int. In python, index starts from 0. 

77 soilTime=soilResp[0,:].astype(int)-1 

78 soilResp_simu=np.asarray(tau)[soilTime]*np.dot(phi_slResp,xsimu 
[:,(soilTime) ])+@.25*(1-b[@]-b[1])*u[soilTime] 

79 ## get simulated woody biomass 

80 woodyTime=woody[@,:].astype(int)-1 

81 woody_simu = np.dot(phi_woodBiom, xsimu[:, woodyTime] ) 

82 ## get simulated foliage biomass 

83 foliageTime=foliage[0,:].astype(int)-1 

84 foliage_simu=np.dot(phi_foilageBiom, xsimul:, foliageTime]) 

85 ## get simulated Forestfloor biomass 

86 forestFloorTime = forestFloor[@, :].astype(int) - 1 

87 forestFloor_simu = np.dot(phi_cForestFloor, xsimul:, forestFloorTime]) 

88 ## get simulated ForestMinearal biomass 

89 cMineralTime = forestMineral[0, :].astype(int) - 1 

90 forestMineral_simu = np.dot(phi_cMineral, xsimu[:, cMineralTime] ) 


91 return [woody_simu, foliage_simu, litterYr, forestFloor_simu, forestMineral_simu, 


soilResp_simu] 


Figure 24.4. Python code for mapping seven pool sizes (xsimu) to six simulated data sets (woody_simu, foliage_simu, litterYr, forest- 


Floor_simu, forestMineral_simu, and soilResp_simu) from the TECO model simulation. 
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## The mappings 


phi_woodBiom=[0,1,0,0,0,0,8] 
phi_foilageBiom=[0.75,0,0,0,0,0,0] 


phi_cMineral=[0,0,0,0,1,1,1] 


230 phi_cForestFloor=[0,0,0.75,0.75,0,0,9] 


phi_slResp=[@.25*c[0],0.25*c[1],0.55*c[2],0.45*c[3],0.7*c[4],0.55*c[5],0.55*c[6]] 


phi_litterfall=[0.75x*xc[01,0.75*xc[1],0,0,0,0,0] 


247 ## only update 2 mapping when new parameter values generate 
248  phi_slResp=[0.25*c_new[0],0.25*xc_new[1],0.55*xc_new[2],0.45x*xc_new[3],0.7x*c_new 


[4],0.55*xc_new[5],0.55x*xc_new[6]] 


249 phi_litterfall=[0.75x*xc_new[09],0.75*c_new[1],0,0,0,0,0] 


Figure 24.5. The Python code for (a) defining the mapping operators; (b) updating mapping operators according to parameter 


values (c_new). 


250 44 Step 3: TECO model 

251 ## running model, return [woody_simu, foliage_simu, litterYr, forestFloor_simu, 
forestMineral_simu, soilResp_simu] 

252 simuList = run_model(c_new) 

253 

254 ## Step 4: Cost Function 

255 J_new=sum([sum((simuList[i]-obsList[i] )**2)/(2*varList[i]) for i in range(6)]) 


Figure 24.6. Python code for calculating the mismatch between simulated (simuList) and observed datasets (obsList) based on 


cost function. 


observation variance; obs(Z,) is the time series of 
ith observation; Z;(t) is the ith observation at time 
t; p; is the mapping operator; X(t) is the simulated 
seven pool sizes at time t; p,X(t) is the ith simulated 
data set after mapping. 

The value of mismatch is saved in the variable 
J_new (Figure 24.6). Step 5 (Optimization) will 
use the value of J_new to decide whether the set of 
parameter values should be accepted or not. 


STEP 5: OPTIMIZATION METHOD 


Our study uses the Metropolis-Hasting method to 
draw parameter samples from their prior distribu- 
tion. This method iteratively executes two phrases 
(i.e., proposing phase and moving phase) until a 
preset iteration number (e.g., 20,000) is reached. 
The proposing phase is implemented by function 
GenerateParamValues, which generates new param- 
eter values (i.e., variable cNew) based the current 
accepted values (i.e., variable c_opt) (Figure 24.7a). 
During this process, function GenerateParamValues will 
call another function isQualified to assess whether 
the newly proposed parameter values (cNew) are 
in the reasonable parameter range or not. If the 


isQualified function returns TRUE, cNew is a reason- 
able new set of parameter values and the genera- 
tion of new parameter values will stop. Otherwise, 
function GenerateParamValues will discard this set of 
parameter values, generate new parameter values, 
assign these values to cNew, and call function isQual- 
ified again. Following the proposing phase, cNew 
will be used to update mapping operators (i.e., 
Phi_slResp and Phi_litterfall) and run model simula- 
tions through calling function run_model (Figure 
24.7b). The mathematical mechanism behind the 
proposing phase is available in Xu et al. (2006). 

The moving phase decides whether the new 
parameter values (i.e., cNew) are accepted or not 
(Figure 24.8). 

The value of J_new (i.e., mismatch between 
the simulated (cNew) and observed datasets) is 
compared with J_record[record] (i.e., the mismatch 
using the previously accepted parameter value 
c_record[ record]). The cNew will be accepted if J_new 
is smaller than J_record[record], or the value exp 
(J_record[record] — J_new ) is larger than a random 
number (randNum) from the uniform distribution 
between 0 and 1. If accepted, the new param- 
eter values are saved into an array, c_record. The 
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def GenerateParamValues(c_op): 


29 """Generate new parameter values based on eigvalue and eigvectors 

30 Global args: eigD, eigV, cmin, cmax, paramNum""" 

31 flag = True 

32 while (flag): 

33 # Normally distributed pseudorandom numbers 

34 randVector = np.random.randn(paramNum) 

35 cT = randVector * np.sqrt(eigD) 

36 cNew = np.dot(eigV, (np.dot(eigV.T, c_op) + cT)) 

37 if (isQualified(cNew) ): 

38 flag = False 

39 return cNew 

40 

41 def isQualified(c): 

42 """Decide whether the new parameter values exist in [cmin, cmax] value interval 

43 Global args: paramNum, cmin, cmax""" 

44 flag = True 

45 for i in range(paramNum) : 

46 if(cli] > cmax[il or cli] < cmin[il): 

47 flag = False 

48 break 

49 return flag 

(b) 

245 ## Proposing step: generate a new set of parameter values based on current accepted 
parameter values 

246 C_new = GenerateParamValues(c_record[:, record]) 

247 ## only update 2 mapping when new parameter values generate 

248 phi_slResp=[@.25*c_new[0],@.25*c_new[1],0.55*c_new[2],0.45*c_new[3],0.7*c_new 
[4],0.55*xc_new[5],0.55xc_newL[6]] 

249 phi_litterfall=[09.75*xc_new[0],0.75xc_new[1],0,0,0,0,0] 

250 ## Step 3: TECO model 

251 ## running model, return [woody_simu, foliage_simu, litterYr, forestFloor_simu, 
forestMineral_simu, soilResp_simu] 

252 simuList = run_model(c_new) 


Figure 24.7. Python codes for (a) defining the function of proposing phase in Metropolis-Hasting algorithm; (b) calling the 


function to generate parameter samples. 


254 HH Step 4: Cost Function 

255 J_new=sum([sum((simuList[i] - obsList[i])**2) / (2 x varList[i]) for i in range(6)]) 
256 delta_J = J_record[record] - J_new 

257 ## Moving step: to decide whether the new set of parameter values will be accepted or not 
258 randNum = np.random.uniform(@, 1, 1) 

259 if (min(1.0, np.exp(delta_J)) > randNum): 

260 ## accept the new set of parameter values and update relevant variables 

261 record += 1 

262 c_record[:, record] = c_new 

263 J_record[record] = J_new 

264 ## print out the acceptance rate 

265 print('simu=' + str(simu) + ' accepted=' + str(record) ) 


Figure 24.8. The Python code of the moving phase in Metropolis-Hasting algorithm. 


corresponding mismatch will be saved in another 
array, J_record. The count of accepted parameter 
values, record, is then increased by 1. Therefore, 
c_record[record] and ]_record|record] indicate the 
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current last element in the two arrays, which also 
represent the currently accepted parameter values 
and corresponding mismatch. If the new parame- 
ter values from this iteration are not accepted, they 


PRACTICE 6 


are discarded. Whether cNew is accepted or not, the 
next iteration always uses the currently accepted 
parameter values (c_record[ record]) for the proposing 
phase to generate a new set of parameter values. 


STEP 6: ESTIMATEPARAMETERS 


The outputs of DA are accepted parameter values 
(c_record), corresponding mismatches (J_record), the 
number of accepted parameter values (record), the 
optimal parameter values through maximum likeli- 
hood estimation (bestC) and the simulated data sets 
given the optimal parameters (bestSimu). All these 
outputs are saved to text files through calling func- 
tion write_io_file (Figure 24.9). All parameter values 
accepted are saved in ‘param_accepted.txt’, 
all mismatches with parameter values accepted in 
mismatch accepted.txt’, the number of 
accepted parameter values in 'accepted num. 

txt’, the optimal parameter values in “best- 
Param.txt', and simulation outputs given the 
optimal parameter values in ‘xxx_bestSimu. 

txt’ (e.g, Woody CestSimu. txt) in the out- 
put directory folder (outDir). 

Posterior distributions of parameters are the 
constrained parameter range after DA, which 
are often used for estimating the parameter 
uncertainty. An R script is provided to plot the 
posterior distributions with all accepted param- 
eter values in ‘param_accepted.txt’. The 
peak of the distribution represents the optimal 
parameter value. The simulated data sets with 
these optimal parameter values are also com- 
pared with observations in the R script. After DA, 
“Probabilistic _inversion.py' 
the R script automatically and saves plots into the 
“/figures' folder. 


runs 


102 def write_io_file(outDir): 
103 """write outputs to files. 


STEP 7: PREDICTION 


Parameter uncertainty as expressed by the pos- 
terior distribution of parameters will translate 
into prediction uncertainty characterized by the 
cumulative probability distribution of simulated 
C pool sizes. Prediction in the study of Xu et al. 
(2006) uses two functions: prediction and for- 
ward_run (Figure 24.10a). The function prediction 
is the start point of the prediction step. First this 
function generates 12,000 samples (c_all) from 
the accepted parameter values saved under outDir 
folder. Then function prediction uses each sample 
(c) of these 12,000 sets of parameter values for 
model forward simulation through calling func- 
tion forward_run. The variable x_record saves all sim- 
ulated pool sizes (Line 128 in Figure 24.10a). 
Finally, the results of prediction (i.e., x_record) will 
be written to the file ‘prediction.txt’ in 
the outDir folder. To call function prediction, we only 
need to provide the directory of DA results (outDir) 
as shown in Figure 24.10b. 

With each set of parameter values (c) sampled 
in function prediction, function forward_run executes 
the model simulation over the 10 years following 
the DA period (i.e., 2001 to 2010) using Equation 
24.1.This function is similar to function run_model 
in Step 3 except that the simulation times are dif- 
ferent. The environmental scalar (tau_forescast) and 
C input from photosynthesis (u_forest) from 1996 
to 2000 were replicated twice to provide environ- 
mental ‘forcing’ for 2001 to 2010. Variable x stores 
the seven simulated C pool sizes in each daily 
simulation step. The return value of forward_run is 
the simulated pool sizes at the end of year 2010 
(xsimu). Line 128 in Figure 24.10a shows an exam- 
ple of calling forward_run in function prediction. 


104 Global args: c_record, J_record, record, bestC, bestSimu""" 
105 np.savetxt(outDir+'/mismatch_accepted.txt', J_record[1:record]) 
106 np.savetxt(outDir+'/param_accepted.txt', c_record[:, 1:record]) 


107 np.savetxt(outDir+'/accepted_num.txt', [record]) 
108 np.savetxt(outDir+'/bestParam.txt', bestC) 
109 np.savetxt(outDir+'/Woody_bestSimu.txt',bestSimu[@]) 


110 np.savetxt(outDir+'/Foliage_bestSimu.txt',bestSimu[1]) 

111 np.savetxt(outDir+'/Litterfall_bestSimu.txt',bestSimu[2]) 

112 np.savetxt(outDir+'/Forestfloor_bestSimu.txt',bestSimu[3]) 
113 np.savetxt(outDir+'/ForestMineral_bestSimu.txt',bestSimu[4]) 
114 np.savetxt(outDir+'/SoilResp_bestSimu.txt',bestSimu[5]) 


Figure 24.9. The Python code for saving DA outputs to text files. 


XIN HUANG 203 


(a) 


116 def prediction(outDir): 


117 nsample = 12000 

118 4 check whether two files exist 

119 if(os.path.isfile(outDir+"/param_accepted.txt") and 
os.path.isfile(outDir+"/accepted_num.txt")): 

120 c_record = np.loadtxt(outDir+"/param_accepted.txt") 

121 record = int(np.loadtxt(outDir+"/accepted_num.txt"))-1 

122 + generate 12,000 random integer ranging from 1 to record 

123 sampleId = np.random.randint(1,record,nsample) 

124 c_all = c_record[:,sampleId] # 12,000 samples of p(c|Z) 

125 x_record = np.zeros([7,nsample], dtype=float) 

126 for i in range(nsample): 

127 ¢ = calif: ,i] 

128 x_record[:,i] = forward_run(c) 

129 print('the '+str(i+1)+'th sampling, total '+ str(nsample)) 

130 # save the predicted pool sizes 

13 np.savetxt(outDir+'/prediction.txt', x_record) 

132 else: 

133 # Can't find param_accepted.txt and accepted_num.txt in the output folder 

134 print('Warning: Please finish exercise 1 first!') 


36 def forward_run(c): 


137 """orediction with the parameter values c 

13 Global args: tau, b, u, x@, Nt, cbnScale 

139 oan 

140 ## environmental scalar and C input duplicate to represent 2000 to 2010 

141 Nt_forecast = 365*1@ 

142 tau_forecast = np.tile(tau,2) # repeat twice 

143 u_forecast = np.tile(u,2) 

144 44 x stores 7 simulated carbon pools from 2000 to 2010 (daily scale) 

145 x=np.zeros([7,Nt_forecast+1], dtype=float) 

146 ## multiple matrix A and diagnol matrix C 

147 AC = np.matrix([[-c[0], 0, 0, 0, 9, 0, 0], 

148 [o, -c[1], 0, 0, ©, O, Ol, 

149 [0.7123 * c[0], 0, -c[2], 0, 0, 0, ð], 

150 [0.2877 * c[@], c[1], 0, -c[3], 0, 0, 01, 

151 [9, 0, 0.45 * c[2], 0.275 * c[3], -c[4], 0.42 * c[5], 0.45 * c[6]], 

152 [0, 0, 0, 0.275 * c[3], 0.296 x c[41, -c[51, 0], 

153 [0, 0, 0, 0, 0.004 x c[4], 0.03 x c[5], -c[611]) 

154 x[:,0] = xð 

155 ## simulate carbon pools from 2000 to 2010 

156 for i in range(1,Nt_forecast+1): 

157 ## Eq. 1 in Xu et al., (2006) 

158 x[:,i]=np.dot((np.eye(7)+ACxtau_forecast[i-1]),x[:,i-1])+np.asarray(b)* 
u_forecast[i-1]*cbnScale 

159 ## simulated carbon pools at 2010 

160 xsimu=x[:,Nt_forecast] 

161 return xsimu 

(b) 

273 if ninput == 1 and sys.argv[2] == "ParamRange.txt" and enable_prediction == 1: 

274 prediction(outDir) 

275 elif ninput == 2 and sys.argv[2] == "ParamRange.txt" and enable_prediction == 

276 prediction(outDir) 


Figure 24.10. Python code for (a) defining functions for prediction; (b) calling functions for prediction. 


EXERCISES WITH CARBOTRAIN TOOLBOX 


The following three exercises using the CarboTrain 
toolbox will help readers become familiar with the 
DA methodology described above. Exercise 1 is to 
conduct DA with the TECO model separately for 
ambient and elevated CO, treatments. Based on the 
results, we will generate figures similar to Figures 
3, 4 and 8 in Xuet al. (2006). Exercise 2 uses the 


optimised parameter values from Exercise 1 to pre- 
dict soil C pool sizes. The expected figure is simi- 
lar to Figure 9 in Xu et al. (2006). The Posterior 
distributions of parameters c,, c, and c, are bell- 
shaped in Figures 3 and 4 of Xu et al. (2006). In 
Exercise 3 we will re-conduct DA under the ambi- 
ent CO, treatment using enlarged prior parameter 
ranges for parameters c;, cs, cp. Results similar to 
Figure 5 in Xu et al. (2006) are expected. 
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EXERCISE 1 reaches a value approximating 20,000. 
A dialog will pop up to notify that DA 
is finished. Execution time is about 20 
minutes depending on the specifications 
of your computer. 

3. After DA, a dialog appears to notify that 
the task is complete (Figure 24.12b). 
Open the output directory you speci- 
fied in Step 1. There will be four 
folders: “/practice 1 ambient’, 
‘/practice 1 elevated’, 
‘/practice 2’, ‘/practice 3_ 
ambient'.Openthe'/practice 1_ 
ambient’ folder. You will find text files 
and figures generated in the subfolder */ 


1. Open CarboTrain > A dialog appears 
(Figure 24.11) > Select Unit 6 and 
Exercise 1 > Choose ambient CO, > 
Choose an output directory on your 
computer Click “Run Exercise’ > A 
dialog appears (Figure 24.12a) > Click 
‘OK’ 

2. A series of outputs will be printed in 
another window (Figure 24.13). The 
variable simu is the number of total exe- 
cuted simulations while the accepted vari- 
able means the number of simulations 
with accepted parameter values. These 
numbers will keep increasing until simu 


ON MainWindow 


Unit 6 NY 


Config2  Config3and4 Config5  Config6 Config?  ConfigSand9.1  Config9.2and9.3 Config 10.1 Config 10.2 Config 10.3 


co2 ambient Y Open source code 


re cat D:/CarboTrain/Unit6 Set Output Folder 


ma 


Figure 24.11. The main window in CarboTrain toolbox for this practice. 


(a) (b) 


ON Info 


(D Task submitted! (D Finished! 


Figure 24.12. The dialog notifies (a) submitting a new task; (b) finishing a task. 
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simu=359 
simu=361 
simu=370 


simu=371 


epted=165 


C 
cepted=166 


simu=388 a 
simu=393 
simu=395 
simu=396 
simu=413 
sima=415 
simu=419 
simu=426 
simu=428 
simu=429 
simu=430 
simu=431 accepted: 
simu=432 accepted=17 

sima=434 accepted=179 
simu=436 accepted=180 
sima=437 accepted=181 
simu=44l accepted=182 


Y 


Figure 24.13. Window in CarboTrain toolbox to dis- 
play the progress of the DA experiment’. 


EXERCISE 2 


1. In the main window of CarboTrain, 
select Unit 6 and Exercise 2 > Use the 
default output folder as for Exercise 19 
Click Run Exercise”. 

2. Open the output folder and find a fig- 
ure generated in the ‘/practice 2’ 


EXERCISE 3 


1. Repeat Steps 1-3 of Exercise 1 but choose 
ambient CO, and Exercise 3 in the 
CarboTrain main window (Figure 24.11). 

2. After DA, go to the output directory 
and open the ‘/practice 3 ambi- 
ent’ folder. The files and figures gener- 
ated in this folder are results of DA using 
enlarged prior parameter ranges for c3, Cs 


SUGGESTED READING 


Xu, T., White, L., Hui, D. and Luo, Y., 2006. Probabilistic 
inversion of a terrestrial ecosystem model: Analysis 
of uncertainty in parameter estimation and model 
prediction. Global Biogeochemical Cycles, 20(2). 


figures’. These files and figures are 
results of DA under the ambient CO, 
treatment. 

4. Repeat Steps 1-3 but choose elevated 
CO, in Step 1. Do not change the output 
directory. After DA, open the ‘/prac- 
tice 1 elevated’ folder in the out- 
put directory. The files and figures in this 
folder are results of DA under the elevated 
CO, treatment. Compare the results with 
Figures 3, 4 and 8 in Xu et al. (2006). 


QUESTIONS: 


What is the acceptance rate in each exercise? 
Which parameters are well constrained? Which 
observations contribute to the constrained 
parameter values? How many shapes are there 
in the posterior distributions (e.g. bell shape)? 
What are the meanings behind these different 
shapes? 


folder. Compare this figure with Figure 9 
in Xu et al. (2006). 


QUESTIONS: 


Does elevated CO, positively influence C accumula- 
tion (i.e., larger pool sizes)? Do poorly constrained 
parameters (i.e., cz, c; and c,) influence the pre- 
dicted sizes of associated pools (i.e., X;, X; and X,)? 


and c,. Compare the results with Figure 5 
in Xu et al. (2006). 


QUESTIONS: 


Did the posterior distributions of c,, c, and c, 
change after their prior parameter ranges were 
extended? What else can we do to further constrain 
the parameter uncertainty? 


Huang, X., D Lu, DM Ricciuto, PJ Hanson, AD 
Richardson, XH Lu, ES Weng, S Nie, LF Jiang, EQ 
Hou, IF Steinmacher, YQ Luo. 2021. A model-inde- 
pendent data assimilation (MIDA) module and its 
applications in ecology. Geoscientific Model Development, 
14: 5217-5238. 
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This chapter is intended as a brief introduction 
to carbon-cycle modeling and field measure- 
ments at the Spruce and Peatland Response Under 
Changing Environments (SPRUCE) experiment 
in a forested wetland in northern Minnesota. The 
goal is to familiarize the reader with the study in 
preparation for subsequent training and example 
applications of data assimilation into models using 
SPRUCE data. 


INTRODUCTION 


SPRUCE is a large-scale, decade-long experiment 
designed to assess the response of a northern 
peatland bog ecosystem, which contains a large 
amount of carbon belowground, to changes in 
atmospheric temperature and carbon dioxide 
(CO,) concentrations that approximate possible 
conditions in the latter half of the 21st century. 
Peatlands have been identified as vulnerable eco- 
systems that potentially have large feedbacks to the 
global carbon cycle, and as a result may affect cli- 
mate change. Although peatlands comprise a rela- 
tively small fraction, about 3%, of global land area, 
they are estimated to contain at least one-third of 
all carbon on the land surface. Therefore, under- 
standing the response of these systems to warm- 
ing, changing moisture conditions, and rising 
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atmospheric CO, is critical to our ability to predict 
future climate using coupled Earth system models. 

One key uncertainty is how carbon may leave 
the system to the atmosphere, as either CO, or 
as methane gas (CH,). This is critically important 
because CH, is a much more potent greenhouse 
gas than CO,, with more than 25 times greater 
warming potential over a 100-year timeframe. 
The saturated peat, biogeochemical environment 
and the types of vegetation in peatlands are con- 
ducive for anaerobic decomposition (breakdown 
of organic matter in the absence of oxygen) and 
therefore high levels of CH, emissions compared 
to other ecosystems. While CO, fluxes from aero- 
bic decomposition (when oxygen is available) are 
sensitive to temperature, CH, production and emis- 
sions may be even more sensitive to warming than 
CO,. On the other hand, drying associated with 
warming conditions may reduce methane produc- 
tion and increase methanotrophy (consumption of 
CH, by microbes in the peat). Observations of CO, 
and CH, fluxes from SPRUCE are informing model 
parameter values and structural representations of 
key processes in mechanistic models to improve 
predictive understanding of the system. 

SPRUCE researchers are also very interested 
in how the experimental treatments impact the 
different types of peatland vegetation in terms 
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of physiology, phenology (timing and length of 
growing season), reproduction and mortality. The 
most extreme level of warming in the experiment 
is consistent with climate model projections at the 
end of the century under the strongest greenhouse 
gas emissions scenarios. It effectively shifts the cli- 
mate of SPRUCE to that currently experienced in 
central Missouri, which is hundreds of kilometers 
to the south. 


SITE DESCRIPTION 


The SPRUCE experiment is located in a forested 
wetland called the S1 bog, which is part of the 
United States Forest Service (USFS) Experimental 
Forest in North Central Minnesota. The S1 sys- 
tem is a raised-dome ombrotrophic bog, mean- 
ing that it is rain-fed, nutrient-poor and has a 
perched water table that is disconnected from the 
influence of regional groundwater. The bog has a 
mean annual air temperature of 3.3*C and mean 
annual precipitation of 768 mm. The bog experi- 
ences cold winters: snow cover generally persists 
from late autumn until early spring, and ice layers 
form in the peat that may persist until May or June. 


Within the bog, there is microtopography consist- 
ing of raised areas (hummocks) and sunken areas 
(hollows). Hummock areas are generally 15-20 
cm higher than hollows and are almost never inun- 
dated. In typical years, the water table generally 
ranges from as low as 20-30 cm below the hol- 
low surface in dry conditions to 10-15 cm above 
the hollow surface in wet conditions. At the begin- 
ning of the study, both hummock and hollow sur- 
faces of the bog were nearly completely covered 
with Sphagnum mosses. Above that, there is a woody 
shrub layer dominated by two species: Rhododendron 
groenlandicum (Labrador tea) and Chamaedaphne calycu- 
lata (leatherleaf). There are two main types of trees 
on the S1 bog, Picea mariana (black spruce) and 
Larix laricena (larch). Existing trees were cleared 
in strip cuts in 1974 for a different experiment, 
and new trees have been regrowing since then. 
This area with relatively young, short trees within 
the strip cut areas is ideal for the SPRUCE experi- 
ment because the enclosures do not have to be as 
large or as costly to expose these trees to whole- 
ecosystem warming. 

SPRUCE uses an open-top enclosure design 
(Figure 25.1) in which the whole-ecosystem 


Figure 25.1. Top-down view of a SPRUCE enclosure. 
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warming and elevated CO, treatments are con- 
ducted (Hanson et al., 2017a and 2017b). Air 
warming is accomplished using propane-fired 
furnaces in combination with blowers distributed 
around the enclosure. Peat warming is accom- 
plished using resistance heaters that heat depths 
between 2 and 3 meters below the surface. A cor- 
ral system isolates the hydrology within the enclo- 
sures, so that the water table within may be lower 
or higher than the surrounding bog. 

SPRUCE has a total of ten enclosures with 
five different levels of warming ranging from 
no added heat (+0°C) to +9°C in 2.25°C incre- 
ments. Half of the enclosures have ambient CO, 
concentrations (about 400 parts per million) at 
the five warming levels, while the other half have 
an elevated CO, concentration target of +500 
parts per million (ppm) over ambient, typically 
ranging between 800 and 900 parts ppm at the 
same warming levels. Elevated CO, is supplied 
only during daytime and in the growing season 
when photosynthesis is occurring. No water vapor 
is added, causing reduced relative humidity and 
increased vapor pressure deficit in the warmed 
enclosures. Extensive measurements within the 
enclosures include meteorological conditions, 
peat temperature and water table depth, CO, and 
CH, fluxes using a large-collar chamber measure- 
ment (Hanson et al., 2016), species-level vegeta- 
tion biomass and productivity, phenology cameras 
and porewater chemistry. In addition to the active 
warming treatments, there are passive enclosure 
effects including increased longwave radiation 
input, decreased shortwave radiation and changes 
in air flow, resulting in additional temperature 
increases between 1°C and 2°C at all warming lev- 
els compared to outside the enclosures. Two addi- 
tional plots without enclosures are also measured 
to record the evolution of the ecosystem under 
ambient conditions. The study’s regression-based 
design allows for the determination of response 
curves over a range of conditions and facilitates 
comparison with model outputs. 

Lengthy preparation was required for the large- 
scale experiment. Pretreatment characterization at 
the S1 bog began in 2009, involving extensive peat 
and vegetation measurements. Construction for the 
treatment experiments began in 2012, beginning 
with roadwork and other infrastructure develop- 
ment. The corral systems were then built, followed 
by the construction of the enclosures. Whole- 
ecosystem warming began in August 2015, and 


elevated CO, treatments began in June 2016. The 
experiment is anticipated to run through 2025.A 
number of key science questions were posed at the 
beginning of the experiment: 


e Are peatland ecosystems and organisms vul- 
nerable to atmospheric and climatic change? 


e At what rate will ancient carbon be released 
from accumulated peat in response to deep 
belowground warming, and what is the rel- 
ative release of CO, compared with CH,? 


e What are the interactions between ecosys- 
tem responses to warming and the availabil- 
ity of nutrients and water? 


* How does elevated CO, modify ecosystem 
responses to warming and the availability of 
nutrients and water? 


MODELING FOR THE SPRUCE EXPERIMENT 


Although the scale of SPRUCE is unprecedented for 
a peatland ecosystem manipulation experiment, it 
is difficult to know if the answers to these ques- 
tions at the study site will be consistent across 
other peatland systems. The project investigators 
therefore envision mechanistic modeling as a key 
method to extrapolate the results of SPRUCE to 
other systems. This requires a multi-scale model- 
ing framework that can be applied from site to 
global scales. 

The United States Department of Energy (DOE), 
which provides the primary funding for SPRUCE, 
has heavily invested in such a modeling framework 
which we are using. In 2014, the DOE initiated 
development of a new Earth system model, the 
Energy Exascale Earth System Model (E3SM). This 
model branched from the well-known Community 
Earth System Model (CESM). The land component 
of E3SM, known as ELM, is a land-surface model 
that includes cycling of water, carbon, nitrogen 
and phosphorus. ELM has been used extensively in 
coupled E3SM (Burrows et al., 2020), uncoupled 
(land-only simulations driven by observed atmo- 
spheric forcings), and at the site-level including 
eddy covariance and ecosystem manipulation sites 
like SPRUCE. However, the default version of ELM 
lacks key processes necessary to simulate peat- 
land carbon, water and nutrient dynamics accu- 
rately. A SPRUCE-specific version, ELM-SPRUCE, 
was recently developed (Shi et al., 2015, 2021). It 
incorporates wetland hydrology, microtopography, 
and plant functional types that are not currently 
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represented in ELM including Sphagnum mosses. 
ELM-SPRUCE also includes a more mechanis- 
tic CH, model that explicitly represents micro- 
bial populations involved in CH, production and 
consumption. ELM-SPRUCE can help to answer 
the above research questions, test other hypoth- 
eses about the impact of environmental change at 
SPRUCE, and eventually simulate broader peatland 
regions on a global scale. 

In addition to ELM-SPRUCE, other model- 
ing groups are also simulating the experimental 
treatments at the site. The Terrestrial ECOsystem 
(TECO) model was introduced in Chapter 2. It 
has well-developed methods for model-data inte- 
gration. A version of the TECO model has been 
developed for application at SPRUCE (Ma et al., 
2017). This model, TECO-SPRUCE, is being used 
to make projections of CO, and CH, fluxes under 
the experimental treatments and to estimate the 
magnitudes of prediction uncertainties that result 
from uncertainties in forcings and model param- 
eters (see Chapter 26). 

In general, having projections from multiple 
models is an important way to understand the 
impact of structural uncertainty, which reflects the 
different ways in which different model frame- 
works may represent processes. Some models may 
be more complex than others by including more 
processes, or they may use different equations or 
algorithms to represent a specific process. A model 
intercomparison study focused on SPRUCE is cur- 
rently under development, and SPRUCE data are 
being made available to interested modeling groups. 


MODEL VALIDATION AND UNCERTAINTY 
OUANTIFICATION 


While having projections from multiple models 
is useful, quantitatively estimating within-model 
uncertainty is key to knowing the confidence in 
our predictions, and for understanding what 
measurements may be most useful in constrain- 
ing these predictions further. Model uncertainty 
quantification (UQ) is a key goal of the SPRUCE 
modeling work, and it has been used in both 
the ELM-SPRUCE and TECO-SPRUCE models. An 
important part of UQ is estimating how uncertain 
model parameters contribute to uncertainty in 
predictions such as CO, fluxes or stocks. A model 
such as ELM-SPRUCE is very complex and con- 
tains well over 100 uncertain model parameters. 
An example of an uncertain parameter would be 


the sensitivity of heterotrophic respiration to tem- 
perature (Q,,), for which published values in the 
literature may range between 50% lower or higher 
than the assumed model default value. This param- 
eter uncertainty is likely to cause large uncertain- 
ties in the predicted response of net CO, fluxes 
to experimental warming. Ultimately, we would 
like to calibrate such parameters and reduce their 
uncertainty using measurements; for example, 
the within-enclosure CO, flux information could 
be used to constrain the Q,, value at SPRUCE. 
However, as the number of uncertain parameters 
grows, the computational expense of model cali- 
bration rises exponentially as it requires a larger 
and larger number of model simulations (also 
referred to as ensemble members) to sample the 
potential parameter space. Therefore, we usually 
need to identify the most important parameters 
first using sensitivity analysis. 

Sensitivity analyses are usually conducted for 
a set of model outputs, or quantities of interest 
(Qols). For example, a Qol might be the site- 
averaged net ecosystem exchange over a 10-year 
period, or the average date of leaf-out in spring. 
The contribution of each parameter to the vari- 
ance of a Qol may be estimated in a sensitivity 
analysis using a model ensemble in which mul- 
tiple parameters are varied randomly. Fortunately, a 
smaller number of ensemble members is necessary 
for sensitivity analysis than for calibration. The 
objective of the sensitivity analysis is to reduce the 
number of calibration parameters to a reasonable 
number, usually around 20 or less. In the ELM- 
SPRUCE model, we conduct the sensitivity analysis 
first by identifying minimum and maximum pos- 
sible values for each parameter. This can be done 
by surveying the literature, trait databases such as 
TRY and the Fine Root Ecology Database (FRED), 
or by making educated guesses about parameter 
uncertainty (e.g., +/- 25% from default values). 
An ensemble of model parameter values is then 
created by randomly sampling parameter values in 
these ranges. This model ensemble is used to create 
a surrogate model, or model response surface for 
each Qol. Many different approaches can be used 
to create surrogate models, including machine 
learning; here, we use polynomial functions. 
Using this approach, we found that about 2500 
ensemble members can yield trustworthy sensi- 
tivities for about 65 uncertain parameters in ELM 
(Ricciuto et al., 2018). While running this number 
of simulations is computationally feasible in ELM 
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on a mid-size computing cluster, other models 
may be more or less computationally expensive, 
allowing for different ensemble sizes. 

Model calibration involves finding a set or sets 
of optimal parameters that best fit observations 
by minimizing a cost function (Chapter 21). A 
cost function typically yields a single value that 
can integrate information from multiple types 
of observations (e.g. both CO, fluxes and bio- 
mass measurements) and from observations at 
multiple times. Observations may be weighted 
by their uncertainties or using other methods. 
Model calibration may also involve the estimation 
of parameter and prediction uncertainties. It may 
be accomplished using a variety of techniques. A 
preferred technique is Markov Chain Monte Carlo 
(MCMC), introduced in Chapter 22. MCMC has 
the desirable quality that it can calculate the full 
parameter posterior probability density functions 
(PPDFs) without making any prior assumptions 
about the functional forms of the distributions. 
These PPDFs may also be used to estimate predic- 
tion uncertainty for Qols. However, it is a relatively 
expensive method that requires a large number of 
model evaluations (usually at least 10,000) and is 
not easily parallelized. In models that are fast to 
evaluate such as TECO-SPRUCE, MCMC may be 
used directly. However, in expensive models such 
as ELM-SPRUCE, it is first necessary to construct a 
surrogate model. An example of surrogate model 
calibration in ELM is given by Lu et al. (2018a). 
Similar methods are used as is done for the surro- 
gates used in sensitivity analysis. However, the sur- 
rogate models introduce error into the calibration 
and therefore must be considerably more accurate 
to ensure a trustworthy calibration result. 

Using ELM-SPRUCE, model calibration was 
performed with pre-treatment observations of 
vegetation biomass and productivity for four dif- 
ferent plant functional types (Shi et al., 2021). 
The calibrated model parameters differed substan- 
tially from the default ELM parameters used in the 
global parameterization for boreal plant functional 
types. The calibrated model was then used to pre- 
dict treatment responses at the site for the most 
extreme warming scenario of +9°C. The model 
predicted that black spruce trees would steadily 
decline in biomass and productivity with warming, 
while the shrubs and larch would increase slightly. 
The Sphagnum productivity was simulated to decline 
during dry periods and increase during wet peri- 
ods. We can now begin to validate the model using 


treatment observations. There is some indication 
that the black spruce trees are actually responding 
negatively to the warming treatments, especially in 
comparison to the shrubs and larch trees. However, 
a recent study showed that Sphagnum productivity 
and biomass are rapidly declining, which was not 
predicted by ELM-SPRUCE. This may be in part due 
to a lack of certain processes in ELM-SPRUCE. For 
example, more productive shrubs may shade out 
the moss layer, which cannot be represented cur- 
rently in ELM-SPRUCE because there is no com- 
petition for light in that model. It is also possible 
that the model predictions will improve when we 
begin to use the treatment data for model calibra- 
tion. We did find that ELM-SPRUCE predicts the 
change in net carbon flux with temperature quite 
accurately, and that both model and observations 
indicate that warming causes a significant source 
of CO, and CH, to the atmosphere (Hanson et al., 
2020). If one naively assumes that all peatland sys- 
tems respond similarly over the entire globe, this 
would result in a large source of greenhouse gases 
and a positive feedback which would be large 
enough to further strengthen global warming. We 
will test this assumption in the model in the future 
by running global simulations. Interestingly, how- 
ever, the observations do not yet indicate a strong 
effect of elevated CO, concentrations on vegeta- 
tion biomass at SPRUCE. ELM-SPRUCE and TECO- 
SPRUCE before data assimilation both predict a 
strong fertilization effect. Reconciling this with 
observations will take additional empirical and 
model development work for ELM-SPRUCE and 
this will be informed by data assimilation using 
TECO-SPRUCE. 

The experimental treatments at SPRUCE com- 
bined with the model-data integration framework 
in ELM-SPRUCE, TECO-SPRUCE and other models 
provide a useful testbed for improving predic- 
tive understanding of peatland ecosystems. The 
interaction of the site hydrology, biogeochemis- 
try, and multiple vegetation types under varying 
treatments makes obtaining accurate predictions 
particularly challenging compared to other study 
sites, for example eddy covariance tower foot- 
prints which often have relatively homogeneous 
vegetation coverage. Recently it has been observed 
that responses of the different vegetation types 
to warming are not uniform; while shrubs are 
becoming more productive and growing more 
fine roots (Malhotra et al., 2020), the Sphagnum 
mosses are dying and sharply declining in areal 
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coverage under strong warming (Norby et al., 
2019). The growing season generally becomes 
longer with warming, but some vegetation types 
have a stronger phenology response than others 
(Richardson et al., 2018). Ideally, models of the 
SPRUCE system must be simple enough to ensure 
simulations are relatively inexpensive so that we 
can run parameter ensembles to explore predic- 
tion uncertainty. On the other hand, models must 
contain enough process realism to capture the 
divergent responses of different vegetation types 
and to predict the strong increases in CO, and CH, 
surface fluxes to the atmosphere. Much research 
remains to be done to predict how the SPRUCE S1 
bog and other peatlands will respond to rapidly 
changing environmental conditions. 
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QUIZZES 


1. Why are peatlands land- 


atmosphere feedbacks? 


important for 


2. What is the gradient experimental design for 
this SPRUCE project? 


3. What are the benefits of incorporating modeling 
approaches in such a large experimental study? 


4. When modeling results are different from obser- 
vations, what should we do? 
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Data assimilation is widely used in terrestrial eco- 
system studies. This chapter illustrates the use of 
data assimilation with a site-level study to project 
peatland methane (CH,) emission in response to 
warming. Wetland CH, emissions comprise one 
third of the global CH, source and remain the 
largest source of uncertainty in the global bud- 
get. Wetland CH, emission estimated by process- 
based models (bottom-up) are used as the prior 
information for atmospheric inversion estimates 
(top-down). It is thus important to constrain 
process-based models with in situ observations to 
improve both the bottom-up and top-down esti- 
mates. We give a brief background of methane 
modeling and then show the application of data 
assimilation in the methane model in seven steps. 


UNCERTAINTY IN METHANE MODELING 


Methane (CH,) has 25 times the global warm- 
ing potential of CO, over a 100-year scale 
(Myhre et al., 2013). It is directly responsible for 
approximately 20% of global warming since pre- 
industrial time (Forster et al., 2007). Wetlands are 
an important natural source of CH, emissions to 
the atmosphere, but constitute a principal source 
of uncertainty in the global atmospheric CH, bud- 
get (Saunois et al., 2020). Global wetland CH, 
emission estimated by process-based models have 
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large disagreement compared with atmospheric 
inversion model ensembles (Saunois et al., 2020). 

There are three major sources of uncertainty 
in model estimated CH, emission. The first is the 
uncertainty in mechanisms that control biogeo- 
chemical processes due to the difficulty to acquire 
empirical data. For example, it is very difficult to 
measure the aerobic and anaerobic oxidation of 
methane. The redox potential effect on methane 
oxidation awaits more empirical data to be rep- 
resented in models. The second source of uncer- 
tainty is the wide range of possible parameter 
values for methane-related processes. Flux-based 
measurements of Q, (temperature sensitivity) of 
CH, release from different warming plots at one 
single site range from 2.12 to 32.16 (Gill et al., 
2017). Manually tuning the parameter values to 
match the observed CH, fluxes could achieve the 
right answer with diverse combinations of param- 
eter values — the problem of equifinality. The third 
uncertainty is a poorly mapped wetland extent and 
seasonal inundation. Current wetland maps are 
mainly based on inventory data and satellite obser- 
vations. Inventory maps are limited due to the 
low spatial and temporal coverage, while satellite- 
based maps cannot capture the wetland area with 
dense vegetation cover. 

It is critical to understand how wetland CH, 
emissions may respond to climate change, given the 
much larger warming potential of CH, compared 
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to CO, over a 100-year scale (Myhre et al., 2013). 
Terrestrial biosphere models that include methane 
processes explicitly describe the CH, flux exchange 
through plant-mediated transport, diffusion, and 
bubbling (ebullition). These are the three major 
pathways of wetland CH, emission. The relative 
contributions of these three pathways to methane 
emissions under climate warming have not been 
unraveled either using experiments or modeling 
approaches. In most process-based methane mod- 
els, these CH, emission pathways are calculated 
based on CH, concentration in each peat layer, 
which is primarily dominated by CH, produc- 
tion. If some of the parameters in CH, production, 
plant-mediated transportation, ebullition, and dif- 
fusion can be constrained by observational data, we 
may be able to improve model predictions both by 
improving accuracy and reducing uncertainty. 


ASSIMILATION OF METHANE EMISSIONS DATA 
INTO THE TECO MODEL 


The seven steps of data assimilation were intro- 
duced in Chapter 21. As an illustrative example, we 
will apply these steps to the assimilation of meth- 
ane emissions data from in situ measurement into 
the TECO model. 


Step 1: Define the Objective. 


Our objective is to reduce the uncertainty of 
model estimations of how methane emis- 
sions vary in response to warming. Taking the 
example of a peatland methane study (Ma 
et al., 2017), data assimilation is used first to 
constrain parameters with observational data, 
thereby reducing uncertainty in methane 
prediction. 


Step 2: Prepare Data. 


Our data come from the Spruce and Peatland 
Responses Under Changing Environments 
(SPRUCE) experiment at a northern peatland 
site in Minnesota, USA. The SPRUCE project 
uses experimental warming and elevated CO, 
treatments to assess the responses of northern 
peatland ecosystems to future environmen- 
tal conditions (Hanson et al., 2016). Here we 
will use in situ net CH, emission data from 
ambient plots to constrain parameters of a 
process-based methane model. CH, emission 
measurements were acquired during the snow- 
free months using a portable open-path ana- 
lyzer in a static chamber (1.13m? area), near 


monthly from 2011 to 2016 (Hanson et al., 
2017a and 2017b). The 2010-2014 data are 
used for data assimilation and 2015-2016 
are used for validation. In total, 45 daily CH, 
chamber measurement data points were inte- 
grated from ambient plots from 2011 to 2016. 
A mean and a standard deviation are calculated 
from all the measurements on the same day in 
each ambient plot. 


Step 3: Choose a Model. 


Our example uses a methane-enabled version 
of the Terrestrial ECOsystem (TECO) model, 
which incorporates a ten-layer vertical mixing 
CH, module (Ma et al., 2017).The TECO model 
has been calibrated to the SPRUCE site to study 
the carbon cycle and soil thermal dynamics 
(Jiang et al., 2018). A detailed description of 
TECO can be found in Weng and Luo (2008). 


TECO-CH, explicitly considers the tran- 
sient and vertical dynamics of CH, produc- 
tion, CH, oxidation, and CH, transport from 
belowground to the atmosphere. The structure 
and processes are adapted from a number of 
previous studies and models such as CLM4.5 
(Riley et al., 2011), LPJ- WHyMe (Wania, Ross, 
and Prentice, 2010), Walter’s model (Walter 
& Heimann, 2000), and TEM (Zhuang et al., 
2004). All the above models assume that soil 
can be separated into aerobic and anaerobic 
layers divided by the water table. These models 
also assume that the majority of CH, oxidation 
occurs in the aerobic layers and rhizosphere, 
and that most methane production occurs in 
the anaerobic layers. Within each soil layer, 
CH, concentration is calculated as the mass bal- 
ance of CH, production (gain), CH, oxidation 
(loss), CH, diffusion from adjacent layers, CH, 
ebullition (loss), and plant-mediated transport 
(loss). The flow diagram of TECO-CH, is shown 
in Figure 26.1 and further described below. 

CH, production is determined by carbon 
availability represented by heterotrophic respira- 
tion, and by soil environmental conditions such 
as water table height and soil temperature. As 
in most methane models, CH, production only 
occurs when soil temperature is above 0°C and 
below 45°C. Given that CH, oxidation is largely 
controlled by CH, concentration, it is assumed 
to follow the Michaelis-Menten kinetics. 

The CH, diffusion across soil layers fol- 
lows Fick’s law, which relates the diffusive 
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Figure 26.1. Flow diagram of CH, module and linkage to soil C model in TECO-CH,. Reproduced from Ma et al. (2017). 
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flux to the gradient of the concentration, and 
Henry's Law, which resolves the diffusive flux 
at the liquid-atmosphere boundary. The net 
exchange between the surface soil layer and 
atmosphere is accounted as the diffusive part 
of CH, emission (or uptake). The methane flux 
at the bottom boundary is set to zero and the 
atmospheric CH, concentration at the soil sur- 
face (or water surface if the water table is at or 
above the soil surface) is set to standard atmo- 
spheric CH, concentration. 

Air-filled aerenchyma tissues of plants act as 
a chimney to quickly emit CH, from the rhizo- 
sphere directly into the atmosphere. A portion 
of CH, is oxidized within the plant tissue dur- 
ing the transport. TECO-CH, uses a parameter 
(Tre) to represent the ability to transport CH, 
through tissues at a plant community level. The 
growth of plants also affects the amount of gas 
transported through the influence of Leaf Area 
Index (LAI). Ebullition entails the formation of 
bubbles when the CH, concentration exceeds 
a certain threshold and directly emits into the 
atmosphere if the water table is above the soil 
surface, bypassing the aerobic zones that lead to 
CH, consumption. The bubbles are otherwise 
added to the soil layer just above the water table 
and then diffuse through the upper layers if the 
water table is below the soil surface. 

Once a model is chosen, the next step is to 
choose parameters for optimization. The per- 
formance of data assimilation is affected by the 
variety and amount of observational data as well 
as the parameters that are targeted for optimiza- 
tion. One common way to choose parameters 
is through a sensitivity test. In this study, we 
choose nine key parameters used in TECO-CH, 
for an initial sensitivity test (Table 26.1). Four 
of these parameters are revealed to be particu- 
larly important for the modelled variability in 
CH, emission, i.e., the emissions are sensitive 
to those parameters. We thus pick these param- 
eters for data assimilation. The prior ranges 
of these parameters are estimated from pub- 
lished experimental measurements or empiri- 
cal values used in CH, biogeochemistry models 
(Table 26.1). 


Step 4: Cost Function. 


As the data assimilation algorithm for this 
study, we will choose the adaptive Metropolis- 
Hastings Monte Carlo Markov Chain (MCMC). 


The approach was introduced in Chapter 22. 
Figure 26.2 shows the logic flow of data assim- 
ilation written into the source code of the TECO 
data assimilation framework. At each step in the 
chain, a new set of parameter values is chosen 
at random, the model generates results with 
these parameter values, and the disagreement 
to the observations is quantified using the cost 
function: 


(26.1) 


The cost function aggregates a total model- 
data mismatch value (J) from all the 30 time 
points (since we have 30 near-monthly net CH, 
emission observation from 2011 to 2014). In 
this study n equals 1 as we have only one set 
of observation data (net CH, emission rate). 
We save modelled CH, emission as X(t) when 
the corresponding observed CH, emission is 
available (Z(t)) at time t. The standard deviation 
(6(0)) is considered as the confidence level of 
the observation. In this study, it reflects errors 
from instruments, measurement, and spatial- 
temporal heterogeneity. It is used in the cost 
function to adjust the weight of model-data mis- 
match for individual data points. In data assimi- 
lation, both mean and standard deviation from 
observations are very important and should 
always be carefully considered in practice. 


Step 5: Optimization. 
If Jew (Step 4) passes the acceptance crite- 
ria, the proposed parameter values are saved. 
Acceptance depends on the value of the cost 
function relative to that (Ju) from the previous 
iteration of the MCMC. An initial value for Jpg is 
set at 9,000,000.This is just a large initial value, 
to ensure that first proposed value is accepted 
and the iteration begins. At each step, if Jew < 
Jas» the proposed parameter values are accepted, 
Jax Value is updated with Jew and used in the 
next round. After a warm-up period, parameter 
values start to converge (accepted parameter 
values during this period are discarded from 
the posterior distribution). Figure 26.3 shows 
the trajectory of updated model-data mismatch 
(n) on the MCMC. As the MCMC seeks for 


global minimum mismatch, the acceptance 
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TABLE 26.1 
Major parameters in CH, production, oxidation, diffusion, ebullition, and plant mediated transportation in 
TECO-CH,. Bold signifies parameters used for initial sensitivity test. Parameters with a range indicate the model is 
sensitive to their values and are used for data assimilation 


Process Parameters Values Range Unit Description References 
CH, production r_me 0.65 (0.0,0.7) — Potential ratio of Zhuang et al. 
anaerobically (2004), Segers 
mineralized C (1998), Zhu 
released as CH, et al. (2014) 
Qio_pro Liz (0.0,10) = Qio for CH, Walter and 
production Heimann 
(2000) 
Topipio 20.0 ye Optimum Wilson et al. 
temperature for (2016) 
CH, production 
CH, oxidation Kems 5.0 = umol L~! Michaelis Menten Walter and 
coefficients Heimann 
(2000), Zhang 
et al. (2002), 
Ois 15.0 (3.0,45.0) pmolL™h™ Maximum Zhuang et al. 
oxidation rate (2004 
Qio—oxt 2.0 — = Qio for CH, Walter and 
oxidation Heimann 
(2000), Meng 
etal. (2012) 
Toptozi 10.0 °C Optimum Zhuang et al. 
temperature for (2004 
CH, production 
CH, diffusion dr 0.66 - = Tortuosity Walter and 
coefficient Heimann 
(2000 
Dai 0.2 - cm?s7! Molecular Walter and 
diffusion Heimann 
coefficient of (2000 
CH, in air 
Die 0.00002 cms”! Molecular Walter and 
diffusion Heimann 
coefficient of (2000 
CH, in water 
CH, ebullition (CH,) thre 750 = pmol L~! CH, concentration Walter and 
threshold Heimann 
above which (2000), Zhu 
ebullition et al. (2014) 
occurs 
Plant-mediated T 0.7 (0.01,15.0) = factor of transport Walter (1998), 


transportation 


veg 


ability at plant 
community 
level 


Zhuang et al. 
(2004) 


Reproduced from Ma et al. (2017). 
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Read in Observation data, mean (Z(t)) and standard 
deviation (0) 


do i=1,50000 
o Generating new parameter sets 
o Run model to get simulated CH, emission, X(t) 


o Cost function (compare X(t) to Z(t)) 
o Accept/reject the new parameter sets 


o Save all the accepted parameter sets 
enddo 


Randomly save 500 sets of model output generated 
by the accepted parameter sets 
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Figure 26.2. Logic flow of the TECO-CH, data assimilation framework. 


Figure 26.3. A diagram showing trajectory of updated model-data mismatch (J, 
iteration, orange dot is one of the local minimum mismatch point (J, 
green is global minimum mismatch (J, 
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criteria also allow J,,, to be accepted with a very 
small probability when Jew > Jw so that the 
chain gets a chance to leave the local minimum 
mismatch point and reach the global minimum 
mismatch. See Chapter 22 for a further discus- 


sion of acceptance criteria in the MCMC. 


Step 6: Estimating Parameters. 


The posterior parameter distributions 
achieved by MCMC reveal that both of the CH, 


) on MCMC chain. Blue is J,,, at (n — 1)th 


at (n + 1)th iteration, 


new. 


n), red is the next accepted J, 


new new 


production related parameters (the potential 
ratio of anaerobically mineralized C released as 
CH,, and temperature sensitivity of CH, pro- 
duction) are well-constrained (Figure 26.4). 
By applying a linearized Q, function to mea- 
sured CH, emission fluxes, Gill et al. (2017) 
estimated the mean value of CH, flux Q,, to be 
5.63 (2.92-10.52 with 95% confidence inter- 
val) at the same study site during the 2015 
growing season. Our constrained Q, range 
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Figure 26.4. Posterior distributions of parameters of 50,000 samples from M-H simulation. (a) Potential ratio of anaerobically 


mineralized carbon released as CH,; (b) Q,, for CH, production; (c) maximum oxidation rate; (d) factor of transport ability at 


plant community level. Reproduced from Ma et al. (2017). 


is 2.34-6.33 with 95% confidence interval, 
which overlaps with but has a narrower range 
than this estimate by Gill et al. (2017). The 
other two parameters — maximum oxidation 
rate and factor of plant transport ability at com- 
munity level — are not well constrained by the 
data. A longer/denser record of observation 
data and extra datasets such as peat CH, con- 
centration are likely to be helpful for constrain- 
ing these parameters. 


Step 7: Generating Methane Predictions 


To quantify model uncertainty due to param- 
eter values, we can randomly draw sets of 
parameters from the posterior distribution 
and run the model with each, resulting in a 
distribution of outputs for CH, flux. Here, we 
will perform 500 simulations with different, 
randomly drawn, parameter sets, with forcing 
consisting of stochastically generated environ- 
mental variables (2017-2024) based on his- 
torical meteorology data. Results including a 
baseline historical simulation (2011-2016) are 
shown in Figure 26.5. It is apparent that the 
data constrained TECO-CH, simulations match 
both the magnitude and seasonal variations 
of CH, emission reasonably well for both the 
2011-2014 and the 2015-2016 period. 


The historical part of the simulation reveals the 
prediction accuracy of the model, and its sensi- 
tivity to environmental forcing. The model tracks 
the interannual variability of the measurements, 
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notably including the spike emission in 2016. By 
comparing the seasonal variation of CH, emission 
to environmental drivers, we find that wetland CH, 
emission is dominated by surface inundation and 
soil temperature. Soil temperature is the restricting 
factor below 10*C, but the water table level con- 
trols peak growing season CH, emission when soil 
temperature is well in the active range for methane 
processes. CH, emission is more sensitive to soil 
temperature during wet periods when the whole 
soil is inundated. 

Data-constrained parameter probability dis- 
tributions are then used to predict CH, response 
to warming. An increase of +2.25°C, +4.5°C, + 
6.75°C and + 9°C in both air and soil tempera- 
ture is added to drive the TECO-CH, model (a 
set of warming scenarios corresponding roughly 
to the experimental warming treatments at the 
SPRUCE site). We find exponential increase of 
CH, emission in response to warming, with four 
times increase under +9°C warming (Figure 26.6, 
panel a). The uncertainty in plant-mediated trans- 
port and ebullition increases most under warming 
and contributes to the overall change in CH, emis- 
sion uncertainty (panels d-f). 

In summary, this chapter shows how data 
assimilation is applied to reduce the uncertainty of 
modeled methane emission in a northern wetland 
ecosystem. The data assimilation approach used in 
this case study enabled parameter estimation and 
uncertainty quantification for forecasting methane 
fluxes in response to climate change. 
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Figure 26.5. Simulation of CH, emission dynamics based on actual (2011-2016) and stochastically generated (2017-2024) 
weather forcing data. Green dots refer to observations from 2011 to 2014 which are used for data assimilation. Blue dots indi- 
cate observations from 2015 to 2016 which are used for model validation, and error bars indicate the standard deviation of 
each observation. Red line is simulated mean methane emission. The shading area corresponds to 1 standard deviation based 


on 500 model simulations with parameters randomly drawn from the posterior distribution. (a—c) The 2011 daily variation of 
water table, surface soil temperature, and methane emission. Reproduced from Ma et al. (2017). 
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Figure 26.6. Responses of annual CH, emission to warming and elevated CO, (eCO,). Red lines indicate CH, fluxes under 
warming treatments and 380 ppm CO,, blue lines indicate CH, fluxes under warming treatments and 880 ppm CO,. X-axes 


indicate the warming treatments of +0°C, +2.25°C, +4.5°C, +6.75°C and +9°C above ambient level. Shading area correspond 
to mean + one standard deviation based on 500 randomly chosen model simulations with parameters drawn from the posterior 
distribution. Reproduced from Ma et al., 2017. 
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3 
QUIZZES 
4 
1. Give two examples of observation data streams 
that could be used to constrain a methane model. 
SHUANG MA 


What are the main sources of uncertainty in 
wetland methane emission in terrestrial ecosys- 
tem models? 


A. Model structure 

B. Parameter values 

C. Wetland extent/inundation map 
D. All the above 


„ Could you still perform data assimilation if your 


observation data had gaps? 


. Is the CH, emission data able to constrain all the 


parameters in this study? 
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The goal of this chapter is to explore the potential 
for data to support diagnostics and forecasting of 
the terrestrial carbon cycle via model-data fusion. 
This understanding will be built by explaining and 
exploring an existing framework, CARDAMOM, 
which is linked to an intermediate complexity 
model, DALEC. The key learnings will include the 
concept of ecological and dynamical constraints, 
the potential to generate emergent maps of key 
parameters, the role of observational error, and 
the key avenues for future research using earth 
observations. 


INTRODUCTION 
Challenges for Modeling 


We begin with the premise that all models are 
wrong, but some are useful (Chapter 2). Models are 
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wrong because none fully describes the simulated 
system, being instead a simplified and incomplete 
representation. Models are constructed based on a 
series of hypotheses about the target system, and 
these hypotheses and their connections determine 
the model's structure. The structure represents the 
interacting components of a system (its state vari- 
ables), and describes the processes that determine 
system evolution (changes to state variables). We 
recognize that the underlying hypotheses may be 
incorrect, over-simplified and incomplete, to vary- 
ing degrees. 

A core requirement for modeling is data, for cal- 
ibration, validation, and forcing. Data may quantify 
exogenous forcing factors such as weather, which 
affect the rates of processes, such as photosynthe- 
sis, or may input stochastic adjustment such as dis- 
turbance, which can disrupt state variables. Data 
also support calibration of process parameters and 
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the setting of initial conditions for state variables. 
For instance, measurements of leaf photosynthesis 
under varied conditions can be used to calibrate 
the rate parameters for electron transport and car- 
boxylation. The data required for drivers and for 
calibration will be incomplete as not every process 
is measurable. This incompleteness leads to poorly 
determined parameters and missing forcing data. 
Available data will have errors, systematic or ran- 
dom. It is the interaction of hypothesis/structural 
error (e.g., missing processes) and data error/gaps 
that causes models to be wrong. 

Despite these challenges, models are useful 
for testing theoretical understanding and provid- 
ing practical support for management and deci- 
sion making. System models, the focus of this 
book, are particularly useful for understanding 
feedbacks between processes and state variables. 
These feedbacks occur over a variety of scales of 
time and space. For example, stomatal conduc- 
tance responds to atmospheric conditions, that 
change on the scale of minutes to hours, and also 
responds to soil moisture, which changes on the 
scale of days to weeks. Soil moisture responds to 
external, stochastic factors such as precipitation, 
but also to the activity and penetration of plant 
roots, which might vary over months and years, 
and water demand from the total leaf area and its 
stomatal opening. Only process based models can 
allow these complex hypothesized linkages to be 
made and explored. Models provide a means to 
diagnose and understand what controls changes in 
the system, identifying key timescales, feedbacks, 
and interactions. Models are capable of interpola- 
tion in space and time, and therefore of making 
forecasts for the components of the system that are 
represented. 


Model Complexity 


Forest carbon (C) models are structured on a vari- 
ety of different hypotheses and levels of complexity. 
For instance, some simulate the dynamics of indi- 
vidual trees, others of cohorts of different ages, and 
some of pools of C in live and dead organic matter. 
These different representations of the forest system 
trade off realism versus simplicity. Modeling indi- 
vidual trees is more realistic, including the poten- 
tial to simulate competition among them and to 
model adjustments to stand microclimate, that are 
known from observation and experiment to feed 
back to growth processes and therefore system 


dynamics. But this realism requires more hypoth- 
eses, for instance on competition, and therefore 
results in more model complexity. Complexity 
generates more potential connections to observa- 
tions, with increased requirements for drivers and 
for calibration data. If these demands for data can- 
not be met then additional complexity may result 
in at best a poor characterisation of model error 
and at worst a heavily biased model. Complex 
models tend to have more parameters, and this 
extended demand for parameterization becomes 
increasingly challenging to meet. A key challenge 
in model construction is to determine the appro- 
priate and justifiable complexity. 

Process rates in models are determined by influ- 
ences (either internal from associated state variables 
or external from drivers such as weather), functional 
forms, and parameters. Internal influences gener- 
ate interactive control, whereby the magnitude of 
a state variable affects processes that influence that 
or other state variables. Thus, the magnitude of leaf 
area influences the magnitude of photosynthesis 
and evapotranspiration in many models. In more 
complex models the leaf area of individual stems 
may affect the light conditions and therefore pho- 
tosynthesis of other stems, influencing competi- 
tion. External influences in carbon cycle models 
can include weather, management, or disturbance. 
For photosynthesis, downwelling solar radiation 
is a key control, provided as a driving variable 
for the model at the appropriate temporal reso- 
lution and specified units. Parameters determine 
how the magnitude of state variables and driv- 
ing variables are connected to the magnitude of 
the process modeled. For instance, photosynthesis 
(gC m? d~!) might be estimated in a simple linear 
model by multiplying the light energy absorbed 
by a canopy (MJ m? d~!) by a calibrated light use 
efficiency parameter (LUE, gC MJ”'). Functional 
forms determine how inputs and parameters are 
connected algebraically, for example, determining 
whether the response of the process is linear, satu- 
rating or exponential. 

Models strive to be realistic, general and accu- 
rate. But there are complex trade-offs required in 
seeking these goals, particularly the need for mod- 
els to be tractable, and so simple. There is ongo- 
ing debate about whether realistic (i.e. including 
all known processes) or general (i.e. globally 
applicable) models are necessarily more accurate. 
It is more practical for globally-applied models 
to be simpler, as a simple structure means leaner 
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requirements for parameters and drivers. Mapping 
parameter variation globally is less demanding for 
simple models because they have fewer parame- 
ters, and these are usually easier and/or possible 
to obtain from the literature. For example, it is 
simpler to parameterize a LUE model for photo- 
synthesis (as above), with its single parameter for 
calibration for each biome or land cover type, than 
to parameterize typical photosynthesis models in 
land surface schemes that might have >10 param- 
eters for the same process. The potential advantage 
of the more complex model is that it is realistic, 
combining all the dominant controls on a process, 
rather than just a limited selection as in the simple 
model. Realism introduces more detailed pro- 
cesses and extra state variables, and makes more 
connections between them. Additional complexity 
adds parameters and often requires more complex 
functional forms. However, data sparsity can mean 
that parameters and functional forms are not well 
determined. In this case the complex model may be 
less accurate, particularly in making forecasts. With 
large numbers of parameters there is a risk of over- 
fitting, i.e. the complex model can be calibrated to 
match noise rather than signal in the data, given 
its many adjustable parameters. A complex model 
fitted to available data may make poorer forecasts 
than a simple model if the calibration data are 
biased or its errors are poorly quantified. 


Model Error 


While models are imperfect, they can be useful 
if their error can be quantified and understood. 
These errors can be determined through calibra- 
tion and validation against observational data with 
robust error statistics. Validation allows testing for 
over-fitting. Observational error allows the mod- 
eler to weight the importance of data for calibra- 
tion, and avoids over-fitting. Observational error 
can be incorporated into the model forecasts by 
propagating the error through the calibration pro- 
cess into parameter uncertainty. Once parameter 
uncertainty is quantified and included in model 
analyses and forecasts, models can become useful 
tools for generating understanding, constraining 
prediction, and supporting management and con- 
trol. Alternate model structures, based on varied 
hypotheses, can be compared to understand the 
error associated with the model structure. Model 
structural error can be compared to error from the 
parameterization process. 


Data provide objective and independent mea- 
sures of the system of interest and the basis for 
evaluating and developing the hypotheses that 
create models. While models can provide useful 
theoretical tools for developing ecological think- 
ing, for practical purposes related to diagnosing 
and forecasting global change effects it is vital that 
models are calibrated and validated using obser- 
vations of key state variables and their controlling 
factors. The value of observational data is enhanced 
by clear description of their confidence intervals, 
which allows modelers to weight their impor- 
tance robustly. Observational data, particularly 
time series, have exceptional value for evaluating 
understanding of dynamical systems. But the com- 
plexity and expense of collecting these data means 
that data replication is difficult and so uncertainty 
quantification is a challenge. 


Data-Model Integration 


Combining models with data has a long his- 
tory in ecological science. Detailed studies of 
the photosynthetic process provided insights 
into the functional forms of the reactions and 
the critical parameters that led to robust mod- 
els. Photosynthetic measurements at canopy scale 
were used to parameterize simple response func- 
tions by minimizing the sum of square differences 
between observation and model. Soil respiration 
models were derived empirically from large sam- 
ples of respiration data under varying soil condi- 
tions. These data helped produce response models 
for key processes, but further work was required 
to produce system models with feedbacks. 
Combining ecophysiological process models (e.g., 
photosynthesis or respiration) with simulation of 
state variables that provide inputs to and respond 
to these processes has proved more challenging. 
Photosynthesis is a function of CO, concentra- 
tion within the leaf, leaf area, leaf enzyme content, 
temperature, and irradiance. Leaf area is an ecolog- 
ical variable that is itself determined by allocation 
from C fixed by photosynthesis and phenological 
factors such as temperature, photoperiod, interac- 
tion with labile or nonstructural C stores and so 
on. Photosynthesis varies on time scales of seconds 
to hours while leaf area varies on time scales of 
weeks, so the interaction between ecophysiology 
and phenology is complex. To forecast future pho- 
tosynthesis requires coupling a model of photo- 
synthesis and a model of phenology to determine 


MATHEW WILLIAMS 227 


interactions and codevelopment. Times series data 
are required for both these processes to support 
coupled model calibration. 

Eddy covariance (EC) data have revolutionized 
the modeling of ecosystem C cycling by providing 
long time series of net CO, fluxes, the outcome 
of photosynthesis, phenology and respiration. As 
time series have extended, EC data have allowed 
evaluation of model simulated diurnal and sea- 
sonal cycles in ecophysiology, and in some cases 
of succession and disturbance effects on C cycling. 
However, characterizing the uncertainty on eddy 
covariance measurements of net CO, exchange has 
proved very challenging and is an ongoing area 
of study. Converting high frequency measure- 
ments of wind velocity and CO, concentration 
into net CO, fluxes relies itself on a model that 
makes assumptions about atmospheric processes 
on a range of temporal and spatial scales that are 
highly dynamic. It is rare to have colocated eddy 
covariance systems to assess instrumental error. 
Understanding errors arising from the eddy cova- 
riance assumptions is still developing, particularly 
biases that may arise due to averaging time scales 
and complexity of terrain. To maximize the infor- 
mation content of these hard-won data for model 
improvement requires a renewed focus on error 
characterization, particularly in tropical systems 
where EC data are rare. 

We have seen that for creating robust and use- 
ful ecosystem models the fusion of model and data 
must provide information on multiple processes 
operating over a range of timescales, from physio- 
logical to phenological. Data are particularly sparse 
for processes operating at longer time scales, such 
as the evolution of the large pools of C in wood and 
soil. Information for understanding these relatively 
slow changing state variables is limited. The signal 
for change in these pools is difficult to extract from 
eddy covariance data, which are dominated by the 
large gross fluxes of photosynthesis and respiration, 
rather than the slower rates of C changes to soil and 
wood pools. But it is the change in the large pools 
that ultimately determines whether ecosystems are 
C sources or sinks. So, there are open questions on 
how processes that govern these large pools of C 
can be calibrated effectively. Forest inventory data 
provide useful information. But, like EC, high qual- 
ity inventory data, particularly for soils, are rare and 
sparse, concentrated in a few locations globally. Soil 
and forest inventories rarely coincide. For global 
application, models need to be calibrated and 


validated with relevant data that describe the com- 
plex heterogeneity of our planet and its dynamics 
over annual to decadal time scales. 

Earth observation (EO) data are now providing 
vast and expanding data sources for the terrestrial 
carbon cycle that offer exciting new opportuni- 
ties for model calibration and validation. Leaf area 
index, a key auxiliary state variable related to pho- 
tosynthesis, has had global and continuous prod- 
ucts since ~2000 from optical satellites operated by 
NASA (MODIS) and ESA (European Space Agency 
Copernicus data sets), with new ESA Sentinel prod- 
ucts promising even further refinement. However, 
while LAI products are now delivered at finer 
resolutions in space and time, the products have 
poorly quantified biases. Optical satellites also 
provide products related to burned area and to 
deforestation. Meanwhile radar and lidar satellites 
are providing intermittent regional maps of above- 
ground biomass. The errors on these products are 
also poorly developed and a subject of research. It 
is critical that EO products are calibrated and vali- 
dated against detailed field data at relevant scales. 
But the strengths of these EO data are compelling, 
providing insights into ecological variables across 
the globe and their changes over time. If EO data 
can be used to calibrate and validate models there 
is the potential to test the generality of models at 
a new level of detail and across biomes and land- 
scapes. Thus, a major research effort has been con- 
ducted to build approaches for model calibration 
and validation that use EO data. The rest of this 
chapter describes one of these efforts. 


CARDAMOM AND DALEC—AN EXAMPLE 
FRAMEWORK FOR C CYCLE DIAGNOSTICS 


CARbon DAta MOdel fraMework (CARDAMOM) 
is an approach constructed to quantify the inter- 
action between C model structure and data, with 
a focus on earth observations and local to global 
application. CARDAMOM is designed to produce 
a probabilistic model calibration based on local 
observations, sensitive to their number, type, 
and error. This section aims to demonstrate the 
potential of model-data fusion to generate under- 
standing of the terrestrial carbon cycle through 
evaluating the hypotheses contained in models 
and to diagnose C cycling and its uncertainty. In 
this example, CARDAMOM calibrates the DALEC 
ecosystem C model. The text first describes the 
DALEC model, then the CARDAMOM framework, 
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and finally provides an example application, and 
some analysis of its results. 


The Data Assimilation Linked Ecosystem Carbon 
(DALEC) Model 


DALEC is a pool-based, mass balance model of 
the terrestrial C cycle, of intermediate complexity. 
The model runs on daily to monthly time steps, 
and can be applied across a broad range of spatial 
scales (from 1 ha to ~100 x 100 km). The ver- 
sion of DALEC described here has 17 parameters 
and six pools (Figure 27.1). DALEC's six state vari- 
ables include live and dead pools of organic C. Live 
pools represent the vegetation as foliage, wood, 
and fine root C stocks, and a labile pool of C that 
flushes leaves at the start of the growing season. 
Dead pools represent litter and soil organic matter 
C stocks. These pools evolve according to inputs to 
and losses of C from each, which are determined 
by a range of processes. Inputs of C to the sys- 
tem are determined by photosynthesis, which is 
a function of weather variables (drivers), leaf area 
index and leaf traits. Leaf area index is determined 
from the foliar C pool and the leaf trait of leaf C 
mass per area (LCMA). The foliar N parameter 


NPP Allocation 


describes the potential of a unit leaf area to fix C, 
and is another important leaf trait. DALEC uses 
the aggregated canopy model (ACM) to simulate 
photosynthesis. ACM is a response surface model 
which relates climate, soil, and atmospheric driv- 
ers to internal variables to predict gross primary 
production (GPP). ACM is a simplified version of 
the process-based Soil-Plant-Atmosphere (SPA) 
model. The advantage of ACM over SPA is that it 
operates at daily time steps, and so is orders of 
magnitude faster than sub-daily SPA, while still 
capturing the sensitivity of photosynthesis to key 
ecological and physical variables. 

Once GPP is determined, DALEC apportions 
a fraction of this to autotrophic respiration (R,), 
leaving the remainder for net primary produc- 
tion, NPP A simplified model of R,, requiring 
a single parameter, is selected due to the lack of 
a robust process model — this is a key area for 
future research. NPP is then allocated to all plant 
live pools. A phenological model determines how 
allocation to the labile pool and foliage varies over 
annual cycles, and how labile C is used to flush 
leaves, requiring parameters to determine the start 
of leaf flush and of leaf senescence, and their dura- 
tion. The remaining C is then allocated at fixed 


Combustion, 
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Figure 27.1. The DALEC model structure. Green boxes are pools of C in live and dead pools. SOM is soil organic matter. The 


labile C pool represents stored C which supports leaf flush at the start of the growing season. Green lines show C fluxes, iden- 


tified by the text in the blue boxes. GPP is gross primary production. NPP is net primary production. R is respiration, either 
heterotrophic (h) or autotrophic (a). Fire and management can remove C from any of the pools by combustion or harvest, with 


prescribed loss fractions. Climate influences rates of GPP, R}, and in some versions of DALEC influences the fluxes into and out of 


the labile pool. The mass of foliar C determines canopy leaf area index, which is a determinant of GPP — this influence is shown 
by the dashed blue arrow and generates feedback within the system. 


MATHEW WILLIAMS 


229 


fractions to fine roots and wood. Wood and fine 
root C pools have a parameter that determines 
their lifespans, the inverse of turnover rate. The 
generation of plant litter is a first order process 
determined by the mass of C in each foliage, root 
and woody pool and its turnover rate. 

Plant litter from foliage and fine roots enters 
the litter pools, and litter from wood enters the 
soil organic matter (SOM) pool. Both these pools 
have turnover rate parameters that determine the 
mineralisation of these pools to CO,.This turnover 
is also sensitive to air temperature, according to an 
exponential function with a calibrated parameter. 
Litter C also has a decomposition rate parameter 
that converts its C to SOM. Heterotrophic respira- 
tion is the sum of mineralisation from both litter 
and SOM pools. 

Disturbances such as fire can be imposed on the 
C stores in DALEC, so that if the drivers indicate 
such an event, a fraction of C in live and dead pools 
is removed from the ecosystem. Foliage, wood, 
and litter tend to have higher fractional losses from 
fire than roots and SOM. Harvest effects can also be 
imposed, again with pool-specific parameters that 
determine what fraction is removed and what is 
added to litter and SOM. In this application these 
disturbance parameters are derived from the litera- 
ture and are not calibrated. 


The Carbon Data Model Framework (CARDAMOM) 


DALEC here has 23 unknowns. As each param- 
eter and the initial condition of each pool must 
be known for the model to be run, CARDAMOM 
is used to calibrate these model parameters using 
available information at the location simulated 
(Figure 27.2). Calibrating a 23-dimensional 
hypervolume is highly challenging computation- 
ally. One approach to calibration involves differen- 
tiating the model code to determine the sensitivity 
of its outputs (predictions) to its parameters. 
This process allows the model to be inverted and 
parameters found that produce outputs consistent 
with independent measurements of those outputs. 
But in CARDAMOM we use a forward modeling 
approach. This avoids the challenging process of 
code differentiation. Instead, Markov Chain Monte 
Carlo (MCMC) methods are deployed. 

MCMC was introduced in Chapter 22. The 
MCMC approach involves running a model many 
times, with varying parameters. The approach 
finds those parameter sets that generate model 
outputs (e.g., LAI time series) that match inde- 
pendent observations (e.g., from EO), weighted 
by their errors, and keep these sets. By running 
a model millions of times, the MCMC approach 
searches parameter space to find parameter sets 
consistent with observations and observational 
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Figure 27.2. A schematic of the Carbon Data Model Framework, CARDAMOM, showing the inputs to the framework and the 
outputs. EDCs are ecological and dynamic constraints. The process model in this example is DALEC. The framework can be 


applied at a single site or over multiple pixels using earth observations (EO) as inputs. 
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error. The approach to select or reject parameter 
sets is Bayesian. That is, the approach assesses the 
likelihood that model and data set are consistent. 
This likelihood is stored with each parameter set, 
and the search process ends when likelihoods have 
converged. By running separate chains of MCMC 
runs, starting from different initial parameter 
guesses, the approach can check that different 
chains arrive at consistent likelihoods. If chains 
do not converge, the approach is judged to have 
failed. Failure to converge can occur if there are 
problems with the data, their error specification, 
and due to model structural errors. 

Note that the output of this Bayesian approach 
is a set of accepted parameter sets. Parameter values 
are correlated because of the model structure, which 
connects parameterized processes via pool interac- 
tions. We can use a sample of accepted parameter 
sets to generate statistics about each parameter's 
posterior distribution. For instance, it is typical to 
report the 90% confidence interval on each param- 
eter posterior. It is also informative to inspect the 
covariance of parameters produced by MCMC 
approaches to understand process interactions 
within the model. An interesting result from the 
CARDAMOM approach when used at global scales 
was that the estimates of leaf lifespan and leaf mass 
per area were correlated. This result is consistent 
with expectations from the leaf economic spectrum 
but was emergent from the structure of DALEC and 
the climate and time series of LAI assimilated. 

Prior parameter ranges are needed for Bayesian 
calibration, because the calibration process must 
be informed of the search range of each parameter. 
Setting the parameter priors for MCMC calibration 
approaches is a challenging task. If the prior is 
made too wide, then the search effort is enormous 
and the MCMC approach may not find a calibra- 
tion that is accepted, as likelihoods are too low 
and do not converge. If the prior is made too nar- 
row, calibration may exclude values that are actu- 
ally realistic. This can lead to edge-hitting, where 
the posterior parameters are found to cluster at 
the upper or lower bound of the prior range. It 
is important to inspect the posterior distribution 
and compare to the prior to spot edge-hitting. In 
this case the prior range may be expanded and the 
calibration repeated. Another challenge in setting 
the prior is to inform the MCMC of the prior dis- 
tribution across the range. The simplest approach 
is to set a uniform prior with equal likelihood of 
selecting any value in the range during the MCMC 


approach. But in some cases, it may make sense to 
set a gaussian (i.e., normal, or bell-shaped) prior, 
which increases the likelihood of selecting some 
values within this range. A gaussian prior may be 
chosen if one has independent knowledge about 
the value of the parameter. 

Setting observational errors is another impor- 
tant task for Bayesian approaches, because these 
determine the likelihoods of parameter sets. 
Observational data, for instance earth observation 
products, should be provided with an estimation 
of uncertainty. However, in some cases errors are 
not available, and where these errors are provided 
they may be over-confident. Thus, caution suggests 
that relatively large uncertainties should be applied. 
But if observational uncertainties are too large then 
they do not provide constraint. In the absence of 
reliable observational uncertainties, we walk a fine 
line balancing between the need to maximize the 
information content of data without overfitting to 
error-filled observations. It is also advisable to apply 
geometric errors rather than arithmetic errors to 
avoid zero crossing when data with values close to 
zero are assimilated. Thus, the uncertainty on LAI 
might be set at x/=1.5, rather than + 0.5. 

To assist the process of searching dimensions 
of variability and to ensure that nonsensical solu- 
tions are avoided, CARDAMOM uses the concept of 
ecological and dynamics constraints (EDCs). EDCs 
ensure that model simulations are realistic and eco- 
logically viable. EDCs reduce the uncertainty in the 
model parameters by rejecting estimations that do 
not satisfy various conditions applied to C alloca- 
tion and turnover rates as well as trajectories of C 
stocks. For example, EDCs ensure that turnover rates 
of wood are always slower than for fine roots and 
foliage. They ensure that root:shoot biomass ratios 
are within broad but realistic bounds. EDCs can be 
used to favor parameters that result in pools close 
to steady state values. Ensuring a close-to-steady- 
state system means that pools and fluxes are broadly 
consistent with hypothesized processes and climate 
forcing. This assumption is broadly valid for low 
resolution studies with coarse grid cells (e.g., at 1 x 
1° grids). At finer resolutions (e.g. ha) increasing 
heterogeneity will include aggrading and recently 
disturbed systems with accumulating biomass, an 
ongoing challenge for CARDAMOM. 

CARDAMOM is a codebase that integrates the 
DALEC model, the driving data sets, the observa- 
tional data sets and the Bayesian MC algorithm 
(Figure 27.2). CARDAMOM can be applied at a 
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single site, or at multiple sites, including pixel- 
by-pixel across a gridded land surface, even up to 
global scale. However, computational demands are 
high, so running CARDAMOM globally requires 
low resolution (1 x 1° pixels) at present. The out- 
puts of CARDAMOM are accepted parameter sets 
for each pixel/location, consistent with local forc- 
ing and observations. From these parameter sets 
the full C cycle at each location can be reproduced 
probabilistically, by using these and the driving 
data to run multiple instances of DALEC. Typically, 
500 randomly sampled parameter sets are stored 
from each of three chains in the MC process. From 
the resulting 1500 stored parameter sets a detailed 
distribution of uncertainty in processes, traits, car- 
bon fluxes and stores, and their covariances, can be 
determined at pixel scale. 


Innovations in the CARDAMOM Approach 


The outputs of CARDAMOM have clear differences 
from typical C cycle model simulations. CARDAMOM 
produces large ensembles of simulations allowing 
uncertainty to be characterized, whereas typical 
models produce only one simulation, and so lack 
uncertainty estimates. CARDAMOM avoids a strict 
imposition of steady state conditions which typi- 
cal models employ. The typical model is run under 
steady state drivers for hundreds of years or more, 
until all C pools reach a steady state. Then, from this 
steady state, adjustments to drivers are introduced, 
such as climate change, and the adjustment to stocks 
and fluxes are followed. In CARDAMOM there is no 
spin-up. For each location or pixel, EDCs ensure 
that the C cycle is in quasi-steady state if there is no 
clear information from assimilated data on whether 
the system is a source or sink. Thus, the ensemble 
of accepted parameter sets will result in some that 
are sinks, and some that are sources, spanning a 
balanced net ecosystem exchange (NEE) of CO,. 
Thus, the ensemble will register uncertainty about 
source/sink characteristics if this information is not 
clear from assimilated data. 

Typical ecosystem C models use the concept of 
plant functional types (PFT) to assign parameters. 
Thus, there are parameter sets for evergreen broad- 
leaf forest, for C, grasslands, tundra, etc. that are 
the same in all such locations globally. Competition 
among PFTs is simulated by the model in each pixel 
to determine the relative coverage. In CARDAMOM 
parameter sets are independent for each pixel, 
determined only from information available in that 


pixel, its forcing, and EDCs. Thus, parameters dif- 
fer spatially, adjusting along climate gradients and 
varying between continents depending on avail- 
able information. Where information is limited, 
then parameter ensembles will be less constrained 
and will more closely match the prior information 
on parameter bounds. The information content 
of data in each CARDAMOM pixel can be deter- 
mined by quantifying how much each parameter 
posterior is reduced in scale from the parameter 
prior. Likewise, comparing this prior-posterior 
metric among parameters provides insights into 
which processes have been constrained by avail- 
able data, and which have not. For instance, it is 
typical for parameters related to foliage (e.g, leaf 
lifespan, canopy efficiency) to be more strongly 
constrained than those related to root turnover 
or litter mineralisation. This difference is because 
CARDAMOM usually has available time series of 
satellite LAI data which provide useful information 
on foliar processes directly. Information on roots 
or litter is sparse from earth observation. 


An Example of CARDAMOM 


An example of CARDAMOM application at sub- 
continental scale is revealing. Here we have com- 
pleted an analysis for Brazil during 2001-2017. 
The input data include Copernicus LAI time series, 
a biomass map, and a soil C map. Drivers include 
meteorological data, a forest loss product, and a 
burned area product (Figure 27.2). CARDAMOM 
produces parameter estimates, whose median val- 
ues can be mapped to show the ecological spatial 
variation across Brazil (Figure 27.3). It is clear 
from inspection that there are emergent patterns 
resulting from the pixel-by-pixel analysis for leaf 
traits. One can see distinct differences in leaf N, 
leaf C mass per area and leaf lifespan. The Amazon 
basin has patterns clearly different from Cerrado 
(bordering the Amazon to the SE), Caatinga (NE 
Brazil), and Atlantic Forest (E coast of Brazil), for 
example. Within the Amazon there is evidence 
of some parameter differences perhaps related 
to wetland areas. Boundaries between these key 
biomes are clearer or more blurred in some areas. 

These biome patterns are also clear in param- 
eters related to biomass stocks (Figure 27.4). 
Fractional allocation of NPP to leaves, wood and 
roots shows clear emergent trade-offs across Brazil. 
In the Amazon, allocation to wood is the domi- 
nant fraction, reflecting the productive and com- 
petitive nature of tropical moist forests. Similar 
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Observations Drivers 


reanalysis 
e.g. Vapour 
Pressure Deficit 
(kPa) 
EO 


Databased Forest 
Soil Cmap MDF Loss 
(MgC/ha) algorithm Fraction 


EO EO 
Biomass Burned 
estimate Fraction 
(MgC/ha) 


Output 


Figure 27.3. An example of the inputs used in generating a Brazilian model-data fusion (MDF) analysis of C cycling at 1° 
resolution monthly over the period 2001-17.The left-hand panel shows the observational data used to constrain the C cycle, 
including time series of leaf area index (LAI) from satellite observations, a map of soil C from interpolated field data, and a 
biomass map created from multiple satellite data. The right-hand panel shows observations used to drive process rates, such a 
vapor pressure deficit (an input to the GPP model in DALEC) and observations used to force land use change and fire impacts. 
Earth observations (EO) products are used to determine deforestation losses of C, and the burned fraction for combustion losses. 
The MDF algorithm runs very large ensembles in each grid cell using the drivers, and selects parameter sets that generate pools 
of C consistent with the observations and their errors. The output is a selection of these accepted parameter sets for each pixel 
in the analysis. Analysis provided by T.L. Smallman. 
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Figure 27.4. Retrieved median estimates of leaf traits for the Brazilian analysis using CARDAMOM. These traits are parameters 
within DALEC. Leaf N determines the maximum rate of photosynthesis. Leaf C per area determines the relationship between 
foliar C mass and leaf area index (LAI), and the investment requirement in C to construct leaf area. Leaf lifespan determines the 
turnover time of foliage, and therefore the investment of C required each year to maintain the canopy at the observed LAI. For 
each pixel a probability distribution for each parameter is determined. Analysis provided by T.L. Smallman. 
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patterns can be observed in the narrow band of 
Atlantic Forest along the coast. In the dry forests of 
the Cerrado and even drier Caatinga the dominant 
allocation fraction is to roots, reflecting the water 
limited nature of these systems, and their distur- 
bance by fire. Within the Amazon there is evidence 
of regional variability, with the northeastern area 
having relatively higher allocation to wood and 
lower allocation to leaves than other areas. 

A key novelty of the CARDAMOM approach is 
that it generates emergent patterns in C process- 
ing, between and within biomes — for instance 
across the equatorial forest biome. Field stud- 
ies are now providing better spatially resolved, 
bottom-up information on leaf and plant traits 
to guide model calibration. The top-down results 
from CARDAMOM can be evaluated against these 
independent data to check for consistency across 
biomes and continents (Figure 27.5). 

The typical model evaluation approach is to test 
using data independent from the calibration data. 
For instance, CARDAMOM estimates of leaf mass per 
area across the globe, emergent from earth observa- 
tions, climate drivers and DALEC model structure, 
could be compared to interpolations of field obser- 
vations. Areas of agreement and mismatches could 
be highlighted. The magnitude of mismatches could 
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0.2 
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be compared using different versions of DALEC 
(e.g., with alternate phenology schemes), different 
climate forcing, and different earth observations of 
LAI to explore their validity. But there is an alter- 
nate philosophy for model-data fusion approaches, 
which is to assimilate the field observations of leaf 
traits (i.e., not leave any for independent testing). We 
can evaluate the resulting analysis for consistency of 
model and data, and for its likelihood. The proba- 
bilistic approach used in CARDAMOM provides a 
clear quantification of model value. The information 
content of new data sets like a bottom-up traits map 
can be evaluated by comparing analyses with and 
without the new data. Thus, what is the impact on 
analytical likelihood of the new assimilate? Does this 
have a spatial or temporal impact? 


KEY CHALLENGES AND OPPORTUNITIES FOR 
DATA ASSIMILATION 


The earth’s land surface is both dynamic and 
highly heterogeneous with variation in C stocks 
occurring at varied length scales across the globe 
according to biological processes and exogeneous 
factors linked to disturbance and management. 
Disturbance and management operate on a range 
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Figure 27.5. Retrieved median estimates of plant traits for the Brazilian analysis using CARDAMOM. Three of the panels show 
the fraction of net primary production (NPP) allocated to leaves, wood and fine roots (i.e., these fractions sum to 1). The final 
panel shows the wood residence time, indicating the longevity of C storage in live woody biomass. The patterns are emergent, 
as each pixel is analyzed independently to produce likely estimates of these parameters. Analysis provided by T.L. Smallman. 
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of scales, but typically are more important at finer 
spatial resolutions. Thus, as analyses sharpen their 
resolution from 1 x 1° to 1 ha, for instance, distur- 
bance and management become more important 
and ecosystem properties become more dynamic 
and potentially further from steady state. At 1 x 1° 
the annual rate of forest loss is relatively small 
compared to stocks, for typical forested landscapes. 
But in these same landscapes a 1 ha forest plot can 
lose most of its stocks in one year, due to fire or 
clearance, and rates of aggradation can also be fast. 
There is a need to resolve forest C processes at 
fine resolutions (e.g, 1 ha) to support monitor- 
ing, reporting and verification processes linked to 
global treaties for climate. For this reason, there are 
numerous satellite missions under construction or 
in orbit to provide data on forest structure at these 
resolutions. Frameworks such as CARDAMOM 
can assimilate these satellite observations to pro- 
duce C flux estimates at similar resolutions. There 
are technical challenges relating to computing 
power requirements and data management. There 
are also algorithmic challenges to calibrate mod- 
els for dynamic forest patches where disturbance 
and recovery from disturbance are common states. 
High temporal resolution data from satellites pro- 
vide a means to monitor for rates of change and 
thereby to calibrate internal processing of C. Thus, 
repeated biomass stock estimates have been shown 
to provide insights into carbon allocation param- 
eters, and even into key soil parameters, as these 
are downstream of wood dynamics. Further, the 
spatial distribution of biomass stocks also provides 
information on the local landscape dynamics. A 
steady state landscape will have a normal distribu- 
tion of biomass stocks. A degrading landscape will 
be dominated by lower biomass with a long tail of 
higher biomass from remnant forest patches. Using 
this information will require new algorithms not 
currently used in CARDAMOM, to link information 
from neighboring pixels in a landscape analysis. 
The atmospheric C community has a long his- 
tory of using inversion approaches to identify 
source and sink regions from linking atmospheric 
transport models and atmospheric CO, concentra- 
tion data in carbon cycle data assimilation system 
(CCDAS). There is a clear opportunity to link the 
ecological data assimilation in frameworks such as 
CARDAMOM to atmospheric approaches. In fact, 
links to atmospheric data would be valuable partic- 
ularly because identifying whether a landscape is a 
source of sink is challenging. Future developments 


include the potential to link to atmospheric inver- 
sions for independent tests of net exchanges. 


SUGGESTED READING 


For more details on the CARDAMOM method read: 

Bloom, A.A. & M. Williams (2015). Constraining eco- 
system carbon dynamics in a data-limited world: 
integrating ecological “common sense” in a model- 
data-fusion framework. Biogeosciences 12: 1299-1315. 

Bloom, A.B., J.-F. Exbrayat, I. R. van der Velde, L. Feng, 
M. Williams (2016) The decadal state of the terres- 
trial carbon cycle: global retrievals of terrestrial car- 
bon allocation, pools and residence times. Proceedings 
of the National Academy of Sciences 113: 1285-1290. 

Smallman, T.L., J. -F. Exbrayat, M. Mencuccini, A. A. 
Bloom and M. Williams (2017) Assimilation 
of repeated woody biomass observations con- 
strains decadal ecosystem carbon cycle uncertainty 
in aggrading forests. Journal of Geophysical Research 
Biogeosciences 122: 528-545. 


Compare the CARDAMOM approach with other model- 
data fusion approaches in the papers presented below. 
What are the advantages of these other approaches? 

Peylin, P, Bacour, C., MacBean, N., Leonard, S., Rayner, 
P, Kuppel, S., Koffi, E., Kane, A., Maignan, F., 
Chevallier, F., Ciais, P, and Prunet, P (2016) A new 
stepwise carbon cycle data assimilation system using 
multiple data streams to constrain the simulated 
land surface carbon cycle. Geophysical Model Development 
9: 3321-3346. 

Kaminski, T.; Scholze, M. L. U.; Vossbeck, M.; Knorr, 
W. L. U.; Buchwitz, M. and Reuter, M. (2017) 
Constraining a terrestrial biosphere model with 
remotely sensed atmospheric carbon dioxide. Remote 
Sensing of Environment 203: 109—124 


QUIZZES 


1. Models aim to be general, realistic, and accurate. 
Why is it so hard to meet all three goals at once? 


2. What are the arguments for and against calibrat- 
ing and validating global C cycle models at eddy 
covariance sites alone? 


3. How does the CARDAMOM approach to gener- 
ating C model parameters differ from the typi- 
cal plant functional type approach? What are the 
relative advantages of each method? 

4. What are the challenges to applying 
CARDAMOM at very high resolutions (e.g., 1 
ha) across the globe? How might these chal- 
lenges be addressed? 
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CONTENT 


Practice Design / 237 


This practice uses the SPRUCE field experiment as a 
case study to explore how a model's parameter values 
influence its output via sensitivity analysis, and how 
parameters may be estimated via assimilation of data 
from field measurements. Exercise 1 addresses how 
certain model output variables are sensitive to changes 
in certain parameter values. Exercise 2 illustrates how 
observational data change posterior parameter distri- 
butions in comparison to the priors. In this practice, 
we use data from the SPRUCE research site in Northern 


Minnesota, USA, as a context to show possible stud- 
ies you could conduct by using data assimilation. The 
exercises may be performed using the CarboTrain 
software. Instructions for installing and working with 
CarboTrain are available in Appendix 3. 


PRACTICE DESIGN 


Data sets that are used in this practice are listed in 
Table 28.1. 


TABLE 28.1 
The SPRUCE site data used in this practice 


Purpose Data name Year Period Time step 
Environmental variables Soil temperature at 0, 5, 10, 20, 2011-2018 Whole year Hourly 
(input) to drive the 30, 40, 50, 100, 200cm depth 
TECO model, spin uP Air temperature at 2m 
and forward run 
Relative Humidity at 2m 
Wind speed at 10m 
Precipitation 
Photosynthetically Active Radiation 
(PAR) at 2m 
Water balance calibration Soil moisture at 0, 20cm 2014-2018 Whole year Hourly 
Water table depth 2014-2018 Whole year Hourly 
Data streams used in Leaf, wood, root biomass 2014-2018 End of growing season Once a year 
data-model fusion Soil C content 2012 August 13-15 One time 
NEE, GPP ER fluxes 2015-2018 Growing season 1-2 times a 
month 
NOTE: NEE = Net Ecosystem Exchange, GPP = Gross Primary Production, Reco = Ecosystem Respiration. 
DOI: 10.1201/9780429155659-35 237 
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EXERCISE 1: Parameter sensitivity of 


model output 


In this exercise we will examine how the 
parameter values of a model affect its outputs. 
Launch CarboTrain on your computer (see 
Appendix 3) and follow these steps: 


a) Select Unit 7 => Exercise 1 > Set out- 
put directory (eg. mydir/EX1/ 
default) > Click Run Exercise. 

b) Output files appear in subdirectory 
simulation of the output direc- 
tory you specified in the previous step. 
CarboTrain plotted daily cpool. 
png and daily fluxes.png with 
the data in the same folder for a quick 
look. The full model outputs are also 
provided as .csv files. The model output 
can be found in the file: ‘your_output_ 
dir/simulation/Simu_dailyflux14001. 
csv’. The observation data can be 
found here: ‘/CarboTrain/Source_ 
code/TECO_2.3/input/SPRUCE_cflux. 
txt and ‘/CarboTrain/Source_code/ 
TECO_2.3/input/SPRUCE_cpool.txt’. 
Annotations for output data columns is 
here: your_CarboTrain_dir/CarboTrain/ 
Source_code/TECO_2.3/ 

annotations in TECO modeloutput_ 
file_format.jpg 

c) Select Unit 7 => Exercise 1 => Set out- 
put directory (eg, mydir/EX1/ 
increase vcmax). Click Set Initial 
parameters and increase the parameter 
“Vcmax” value by 30%. Click Run Exercise. 

d) Repeat step (c), set a new output direc- 
tory (eg, mydir/EX1/decrease 
vemax). Click Set Initial parameters and 
decrease the parameter “Vcmax’ value by 
30%. 

e) Repeat steps (c) and (d) for increases 
and decreases of the following param- 
eters. Remember to create a new name 
for output directory each time after you 


changed the parameter value. You should 
end up with ten different output files: 


i. SLA, the specific leaf area; 

ii. Tau_leaf, leaf carbon turnover time; 

iii. Tau_root, root carbon turnover time; 

iv. Tau_slowSOM, recalcitrant soil C pool 
turnover time. 

f) Plot the output data from your ten model 
simulations and compare the differences. 
It will be helpful for ease of comparison 
to plot the results for one parameter with 
130% changes (e.g., Vcmax) on the same 
figure. 


After Step (b), you can inspect the default run 
results, such as daily leaf, wood, and root pool 
changes, GPP, NEE, Reco, and other carbon 
fluxes from 2011 to 2016. When you decrease 
and increase the Vemax value by 30% in Steps 
(c)—(d), you will see substantial changes in 
GPP and autotrophic respiration. Similarly, 
when the value of specific leaf area is changed, 
we can see effects on GPP and autotrophic res- 
piration. However, in both cases, we don't see 
much change in the carbon pools. From these 
comparisons, you may realize that Vcmax and 
SLA have similar effects on model outputs. 

By contrast, when you change the turnover 
time of the leaf pool, you may see that the 
steady state size of the leaf pool has considerably 
changed whereas GPP or NEE do not change 
much. At steady state, the leaf pool is about 350 
gC m? when we decrease the turnover time 
by 30%, increasing to about 450 gC m? if we 
increase the turnover time by 30%. Similarly, 
tuning the turnover time of root C changes the 
steady state size of the root carbon pool. Thus, 
we may learn that turnover time can affect the 
steady state of carbon biomass. 

Exercise | shows us that certain output vari- 
ables are sensitive to changes of certain parame- 
ter values. We have also seen that adjustments in 
different combinations of parameter values can 
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sometimes generate the same changes in output 
variables. In other words, we might sometimes 
get right answers but for a different reason. 
An example is that the value of leaf C turnover 
time, SLA, and Vcmax can all be responsible for 
the magnitude and pattern of leaf carbon pool 
sizes and GPP This is so-called equifinality, an 
issue for both manual tuning and Bayesian- 
guided parameter estimation (Luo et al. 2009). 
However, manually tuning parameter values 


EXERCISE 2: Data assimilation with TECO 


This exercise will help you learn to perform 
data assimilation with the comprehensive eco- 
system model, TECO. We will use CarboTrain as 
a toolbox for the exercise. 


1) Launch CarboTrain and select Unit 7 > 
Exercise 2 => Set output directory (e.g., 
mydir/EX2/default) > Click Run 
Exercise. The model runs iteratively 20,000 
times with different parameter values, the 
accepted parameter values are saved in the 
file DA/paraest .txt. It takes several 
hours to finish the full data assimilation 
process. If you want to get fast but not 
accurate results, you may let it run for 
500 times, reducing the run time to 5-10 
mins. To do this, monitor the value for 
‘jsimu’ in your terminal window; when it 
gets to 500, terminate the run by pressing 
Ctrl+C. Then CarboTrain will randomly 
select 100 sets of accepted parameters 
saved in the file DA/paraest.txt, 
and save model forward run output in the 
file forecasting/Simu_daily- 
£lux*** txt. *** could be any of the 
3-digit numbers ranging between 1-100. 
A quick plot of the results is also provided 
in the filepool and flux.png. 
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can be tricky and subjective. Data assimilation 
searches for the whole prior ranges of param- 
eter values with millions of iterations to gener- 
ate posterior probability distributions. 


QUESTION: 


Which output variables are most sensitive to 
change in each of the five parameters? Can you 
explain why? 


2) This exercise is a teaser for you to walk 
through the whole process of data assim- 
ilation. In a formal data assimilation 
study, tens of thousands to millions of 
iterations are normally needed to get to a 
stabilized chain. The number of iterations 
needed depends on the model structures, 
the parameter sampling and selection 
algorithms (e.g., different variants of the 
Monte Carlo Markov Chain algorithm; 
see Chapter 22), and the quality and 
quantity of observations. 


Figure 28.1 shows parameter posterior dis- 
tributions from a full data assimilation run 
informed by observational data from the 
SPRUCE field experiment. We can see that SLA 
and leaf C turnover time are well-constrained 
by the observational data. In Exercise 2, we 
see that parameters to which the model is 
sensitive are more likely to be constrained by 
data. Instead of one value, the data assimila- 
tion study generates probabilistic distribu- 
tions of parameter values, allowing the range 
of uncertainty to be displayed and analyzed. 
With data assimilation, you can quantify 
uncertainties and partition the uncertainties 
into different sources of error and process 
variabilities. 
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Figure 28.1. Parameter posterior distribution from data assimilation based on SPRUCE experimental data using the TECO model. 
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QUESTIONS: 2. Which parameters are most likely to be con- 
strained by the observation data sets, why? 
(hint: think of equifinality and parameter 
sensitivity) 


1. What is a constrained posterior distribution? 


SHUANG MA 241 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


UNIT EIGHT 


Value of Data to 
Constrain Models and 
Their Predictions 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


CHAPTER TWENTY-NINE 


Information Contents of Different Types of Data Sets to Constrain 
Parameters and Predictions 


Enging Hou 
Northern Arizona University, Flagstaff, USA 


CONTENTS 


Introduction / 245 


An Overview of the Information Contents of Model and Data / 246 

A Method to Quantify the Information Contents of Model and Data / 246 

Short- and Long-term Information Contents of Model and Data / 248 

The Information Contents of Data Depend on the Amount and Type of Data / 248 


Model Equifinality / 251 


Prediction of Land Carbon Dynamics After Data Assimilation / 252 


Summary / 253 
Suggested Reading / 253 
Quizzes / 253 


There is an urgent need to reduce model uncer- 
tainty in predicting land carbon dynamics. The 
model uncertainty can potentially be reduced by 
assimilating the increasing amount of observa- 
tional and experimental data into the model. This 
chapter first gives an overview of the information 
contents of models and data and then introduces 
a method to quantify the information contents 
of data and models. Next, the chapter introduces 
the use of data assimilation to extract informa- 
tion from the data and models, and shows how 
the information contained within a model and in 
data can vary depending on the timescale of focus: 
short- versus long-term. The extracted information 
is useful to constrain model parameters and pre- 
dictions, as well as to guide future data collection. 


INTRODUCTION 


Model uncertainty in predicting future land carbon 
balance is huge. Reducing this uncertainty is a major 
challenge in land carbon cycle studies, yet is urgently 
needed in order to support the development of 
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strategies to mitigate climate change. Meanwhile, 
the availability of relevant observational and 
experimental data is increasing, driven by research 
advances and societal needs for data to support the 
management of natural resources in the context 
of global change (Luo et al. 2011). To meet these 
societal needs, we must address critical questions 
such as how best to use data to constrain models 
and how to select appropriate datasets to optimize 
information collection. These questions can poten- 
tially be addressed with the aid of data assimilation 
or data-model fusion techniques, which use data to 
constrain initial conditions and parameters of mod- 
els, thereby constraining their predictions of the 
land carbon cycle (Luo et al. 2011). 

In this chapter, we first give an overview of the 
information contents of models and data. Here, 
the information of a model means the structure 
and parameters of a model, which reflect knowl- 
edge of processes of the system the model strives 
to represent. We then introduce a method to quan- 
tify information content, both for models and 
for data. Next, we discuss how the information 


245 


contained within a model and in data, pertinent to 
the prediction of land carbon dynamics, can vary 
depending on the timescale of focus: short versus 
long-term. After that, we relate the information 
contents of land carbon datasets to model param- 
eters, and define the issue of equifinality of model 
parameters. In the final part of the chapter, we 
address the value of data for constraining model 
predictions on land carbon dynamics. 


AN OVERVIEW OF THE INFORMATION 
CONTENTS OF MODEL AND DATA 


A well-known saying in the modeling community 
is “all models are wrong, but some are useful” (Box, 
1979). That means we cannot construct a “perfect” 
model, but we can make reasonable and useful pre- 
dictions. Ifa model can predict land carbon dynam- 
ics reasonably well during hindcast, based on an 
evaluation of the model output relative to available 
observations, we assume it may inform future land 
carbon dynamics with a high confidence. Whether 
a model can predict the carbon dynamics well dur- 
ing hindcast depends on both model structure 
and parameter values, which constitute the prior 
knowledge of a system (Weng and Luo 2011). 

The structure of a model (the equations and their 
linkages to forcing, state variables, and parameters) 
contains information relevant to its behavior and 
performance as a predictive tool. A model expresses 
the logical consequences of a set of hypotheses, for- 
malized through the model structure, and generates 
predictions that can be compared with observa- 
tions in a quest to falsify these hypotheses (Canham 
2003). In other words, the model is an integra- 
tion of theory and knowledge, which can be used 
to test hypotheses and make predictions (see also 
Chapter 2). For example, ecosystem respiration is 
usually modeled as an exponential or a unimodal 
function of temperature in land carbon mod- 
els, which are built upon empirical relationships 
between ecosystem respiration and temperature 
reported in the literature. These alternative formula- 
tions predict ecosystem respiration similarly at low 
temperatures but differently at high temperatures; 
either way, it is information on the prediction of 
ecosystem respiration (no matter whether correct 
or not). The two response functions represent com- 
peting hypotheses that can be tested and falsified by 
comparing their predictions to measurements. 

Model parameter values also contain informa- 
tion relevant to the predictions made. Once we 
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have a model, whether the model can predict well 
or not will depend on parameter values. Model 
parameters can be tuned rigorously and automa- 
tedly with data assimilation. Prior values or ranges 
(or uncertainty) of model parameters may be set 
based on empirical studies of processes or expert 
judgment informed by empirical measurements, 
which themselves are a kind of information on 
prediction. Assimilating relevant datasets, such 
as land carbon pool and/or flux measurements, 
into a model can further constrain initial condi- 
tions and parameter values of a land carbon model 
in a rigorous way, resulting in a data-informed 
process-oriented modeling approach. This allows 
researchers to do more realistic predictions on 
land carbon dynamics. 

A good example to show how the information 
contents of model and data constrain predictions 
of land carbon dynamics is the study of Weng and 
Luo (2011). They compared model predictions on 
carbon dynamics in a temperate forest using the 
Terrestrial ECOsystem (TECO) model with and 
without data assimilation of eight datasets on car- 
bon pools or fluxes. They found that probability 
density functions (PDFs) of the carbon pool sizes 
in initial predictions by the model were bell-shaped 
for slow-cycling carbon pools including woody 
biomass, structural litter, slow-cycling soil organic 
matter (SOM), and passive SOM, but skewed to 
their low carbon content ends for fast-cycling 
carbon pools including foliage, fine roots, meta- 
bolic litter and fast-cycling SOM (Figure 29.1). 
Departure of the PDFs from a uniform distribu- 
tion (null knowledge) suggests that model struc- 
ture, together with the prior ranges of parameters, 
contains information on land carbon pools, partic- 
ularly on slow-cycling carbon pools. After assimi- 
lating the eight datasets into the TECO model, they 
found that simulated sizes of all eight carbon pools 
except metabolic litter pool were bell-shaped and 
well constrained (Figure 29.1). These results sug- 
gest that data can provide a substantial amount of 
additional information on land carbon dynamics. 


A METHOD TO QUANTIFY THE INFORMATION 
CONTENTS OF MODEL AND DATA 


Weng and Luo (2011) introduced the use of 
Shannon information index (Shannon 1948) 
to measure the relative information of land car- 
bon models and datasets. In information theory, 
entropy (H) is used to measure the “information”, 
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Figure 29.1. Simulated carbon content with parameters sampled in prior distributions (model only) and posterior distribu- 
tions (model + data), respectively. Struct. indicates structural, metab. indicates metabolic, and SOM indicates soil organic matter. 


Derived from Weng and Luo (2011). 


“surprise”, or “uncertainty” of a random variable 
(X); a higher entropy indicates more informa- 
tion, surprise, or uncertainty. In carbon modeling, 
entropy can be used to measure the uncertainty of 
an output variable (e.g., predicted leaf carbon pool) 
of a model; a higher entropy indicates a larger 
uncertainty of the variable. Entropy is maximum 
when there is null knowledge about an output 
variable, i.e., a uniform distribution of values. The 
information content of a model can be calculated 
as the difference in the entropy of a uniform dis- 
tribution (null knowledge) and the posterior PDF 
of an output variable, which carries the trace of 
model information contained in its structure and 
parameters. The information content of a dataset 
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can be calculated as the difference in the entropy 
of posterior PDF of an output variable between the 
model alone and the model when informed by a 
dataset, for instance through parameter calibration 
to fit the model to the dataset. 

Specifically, the entropy H of a discrete random 
variable Xin {x,, ..., x,} is calculated as: 


n 


n(x) =- p(s roz; p(x) 


i=l 


(29.1) 


where p(x) is probability of event x, and n is the 
number of events. The negative sign ensures that 
the result is always positive or zero, because p(x;,) 
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is smaller than or equal to 1.0. If the base b equals 
2, the unit is bit. Equation (29.1) can be applied 
where X is an output variable from a model, or a 
variable in a dataset, to characterize the information 
it contains. The set of X may comprise, for example, 
n points in a time series, or n cells of a spatial grid. 
The information (I) of a model and/or dataset(s) 
is expressed in a relative term, that is, relative to 
information without either a model or any dataset. 
Null knowledge on dynamics of a carbon pool 
is defined by a uniform distribution of the pool 
size within its range. The entropy of null knowl- 
edge, Hp, is calculated as: 
Ho = log, n (29.2) 
Relative information of null knowledge (19) 
would be 0. 
The entropy of the PDF of a carbon pool, when 
simulated by a model alone, H(X,,), is calculated as: 


n 


>> (xmi ) log, p (xmi ) 


i=1 


H(Xn) = (29.3) 


where X,, is a carbon pool simulated by the model 
alone, x, is the mean value of X,, in a bin, and n is 
the number of bins with equal widths in the range 
of the simulated carbon pool. Relative information 


of the model (I,,) is calculated as: 


(29.4) 


Similarly, the entropy of the PDF of a carbon 
pool simulated by a model informed by dataset(s), 
H(Xma), is calculated as: 


n 


H(Xm) = -X p (xnu log P(x) (29.5) 


i=1 


where X, y is a carbon pool simulated by the model 
with the dataset(s), x,,,; is the mean value ofX, in a 
bin, and n is the number of bins with equal widths 
in the range of the simulated C pool. Relative 
information of the dataset(s), I}, is calculated as: 
Ty = H(Xn)-H(Xne) (29.6) 
The above method was developed by Weng and 
Luo (2011) to quantify model and data informa- 
tion on a carbon pool based on the PDF of simu- 
lated C pool sizes. This method has been used by 


Keenan et al. (2013) to quantify model-data mis- 
match in a carbon variable based on the PDF of 
mismatches between modeled values and observed 
values. Moreover, the method may be used to 
quantify model and data information on a model 
parameter based on the posterior PDF of model 
parameter values, as introduced in Chapter 32. 


SHORT- AND LONG-TERM INFORMATION 
CONTENTS OF MODEL AND DATA 


Projecting changes in the land carbon cycle over 
decades to a century is crucial for developing strat- 
egies to mitigate climate change. However, most 
data assimilation studies have focused on land 
carbon cycling in the short term (i.e., within a 
decade). Weng and Luo (2011) explored whether 
and how the information contents of the TECO 
model and data, in terms of their relative contribu- 
tions to predicted land carbon dynamics, change 
with projection time. They found that when pre- 
dicting slow-cycling carbon pools, for example, 
woody and soil C pools, the importance of the 
model increased with time while the importance 
of data decreased with time (Figure 29.2). In con- 
trast, when predicting fast-cycling carbon pools, 
for example, foliage and fine roots, the relative 
importance of the model and data did not change 
much with time (Figure 29.2). Moreover, eight 
datasets examined by the authors contain more 
information on the simulated carbon pools of foli- 
age and fine roots than the model (Figure 29.2). 
The model, however, contained more information 
on the simulated litter carbon pool, fast-cycling 
SOM, and passive SOM than the datasets (Figure 
29.2). These results suggest that both data and 
model are important in simulating land carbon 
cycling; data is particularly important in constrain- 
ing predictions in the short term; the model is 
increasingly important in the long term. 


THE INFORMATION CONTENTS OF DATA 
DEPEND ON THE AMOUNT AND TYPE OF DATA 


Since data can inform models, we may expect an 
increase in information with the increase in data 
amount. Indeed, the information content of data 
generally increases with the amount of data. For 
a specific type of data (e.g., the measurement of 
net ecosystem exchange), information content 
may initially increase but later saturate as addi- 
tional data are added. Not only the amount but 
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Figure 29.2. Simulated C contents and the relative information of model and data on simulated C contents over 100 years. Box 


plots show carbon content distributions in the 5% (bottom bar), 25% (bottom hinge of the box), 50% (line across the box), 
75% (upper hinge of the box), and 95% (upper bar) intervals. Derived from Weng and Luo (2011). 


also the type of data is important to constrain 
models. For example, Richardson et al. (2010) 
assimilated six datasets on carbon dynamics in a 
spruce-dominated forest into a simple forest car- 
bon cycle model, DALEC. They found that the joint 
use of woody biomass increments, and flux tower- 
based measurements of net ecosystem exchange, 
constrained parameter values of DALEC markedly 
better than the use of net ecosystem exchange 
measurements alone. The degree to which param- 
eter uncertainty was reduced depended on both 
the dataset used and its relationship to a given 
parameter. For example, either soil respiration or 


ENQING HOU 


plant biomass increment data were needed to con- 
strain estimates of the fraction of gross primary 
production respired. Litterfall data were required 
to narrow estimates of turnover rate of foliage. Soil 
respiration data were necessary to constrain the 
turnover rate of soil organic matter. 

Data streams usually overlap in the informa- 
tion they contain, and data collection is limited 
by funding and logistical factors. Therefore, it is 
critical to know which data streams are more use- 
ful to constrain models than other data streams. To 
address this, Keenan et al. (2013) constrained a 
simple forest carbon cycle model with 17 different 
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datasets from the Harvard Forest in Massachusetts, 
USA. They iteratively ranked each dataset accord- 
ing to its ability to reduce model uncertainty. 
They found that some types of data were more 
important than others in constraining the model 
(Figure 29.3). For example, the measurement of 
net ecosystem exchange was most useful to con- 
strain the model, followed by soil carbon turn- 
over, soil respiration, and litterfall (Figure 29.3). 
A combination of measurements of fast and slow 
carbon flows was necessary to achieve optimal 
model performance. Models generally become 
better constrained with increasing types of mea- 
surements, but some measurements are relatively 
redundant in the presence of some other mea- 
surements. This is the case where measurements 
overlap in terms of the information they encode 
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relating a desired output of the model to its 
parameters and/or structure. Keenan et al. (2013) 
showed that when all their 17 datasets were used, 
26 of the 40 model parameters could be identi- 
fied or constrained. Parameter uncertainty was 
reduced by 60% on average, but most of this 
reduction was achieved with the use of relatively 
few data streams (Figure 29.3). For example, 
when the six most important data streams were 
used, 14 parameters were constrained. Adding the 
remaining 11 data streams only constrained 12 
more parameters. Fourteen parameters were not 
constrained even when all the available types of 
measurements were used. The soil carbon pool 
did not provide additional information to con- 
strain the model in the presence of soil respiration 
and soil carbon turnover rate. Litter turnover did 
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Figure 29.3. The posterior parameter distributions for the best data combination at each stage in a hierarchical optimization 
process. Variables on the left-hand side are the variables used to constrain the model at each stage (i.e., indicated by each row). 


Numbers on the right-hand side are the number of parameters well constrained at each stage. Parameters are deemed to be well 


constrained if their posterior distribution occupies at most half the range of the prior distribution. Solid gray circles represent 


the optimum parameter value; black areas represent mirrored posterior probability distributions for each parameter. Derived 


from Keenan et al. (2013). 
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not add information in the presence of litterfall. 
Results like these can guide future data collection 
and optimize the use of funding and the deploy- 
ment of labor. 


MODEL EQUIFINALITY 


Model equifinality is defined as the situation 
where different sets of model parameter values 
can yield similar model predictions (Luo et al. 
2009). Assimilating data into a model can reduce 
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equifinality, but it is hard to eliminate completely. 
Equifinality of parameter sets may be indicated by 
bivariate correlation between posterior parameter 
values. However, lack of such bivariate correlation 
does not necessarily indicate a well-constrained 
model. Instead, whether a model is well constrained 
or not may be better indicated by bivariate covari- 
ance between posterior parameter values (Figure 
29.4). This is because covariance scales correlation 
by the standard deviation of parameters, of which 
the latter is lowered in a well-constrained model. 
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Figure 29.4. Characteristics of correlations or covariances among parameters after data assimilation. (a) The number of poste- 
rior parameter distributions that show significant (P < 0.01) correlations for different levels of correlation and different num- 
bers of constraining data sets. (b) The correlation matrix of model parameters for the model constrained by all available data 
sets. The color scale represents the r? correlation between each pair of parameters. (c) The posterior parameter covariance (dots) 
for different numbers of constraining data sets, normalized to the maximum total covariance observed. The line represents a 
polynomial fit to the data. (d) The covariance matrix for the model parameters for the model constrained by all available data 
sets. The color scale represents the covariance normalized to the maximum observed covariance value. Derived from Keenan 


et al. (2013). 
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Correlation between posterior parameter val- 
ues may occur when high values of one parameter 
(or a term scaled by it) can be compensated for 
by low values of another parameter or term in the 
same response function or process. For example, 
soil heterotrophic respiration may be calculated as 
a multiplication of basal respiration rate and tem- 
perature sensitivity of respiration (i.e., Q,p). While 
both base respiration rate and temperature sensi- 
tivity of respiration may be informed by measure- 
ments of soil respiration, trade-off between the 
two parameters allows the model to predict simi- 
lar values of soil respiration with many, anticor- 
related, combinations of the two parameters. In 
other words, the posterior distribution of the two 
parameters may be negatively correlated. In this 
case, a direct measurement of one of the parame- 
ters may constrain the other model parameter and 
produce a more accurate model. 

Correlation between posterior parameter 
values may also occur when counteracting pro- 
cesses in a model cannot be distinguished by 
the data. For example, forest floor litter is usu- 
ally separated into metabolic litter and struc- 
tural litter. However, turnover rates of the two 
carbon pools may not be well informed given 
only litterfall and/or forest floor biomass mea- 
surements. This is because a fast turnover rate of 
one of the two litter pools may be compensated 
by a slow turnover rate of the other litter pool. In 
this case, measured pool size of either metabolic 
litter or structural litter may be helpful to con- 
strain the turnover rates of both litter pools. If no 
such measurements are available, a more parsi- 
monious model with the two litter pools binned 
together may be used to perform data assimila- 
tion and model predictions. 

Note that the absence of a correlation between 
posterior parameter values does not necessarily 
indicate the identifiability of model parameters. 
A correlation may be absent because of high- 
dimensional (i.e., > two dimensional) relation- 
ships among posterior values of model output 
variables, especially when most model parameters 
are poorly informed by data. For example, leaf bio- 
mass is positively correlated with litter biomass, 
which in turn is negatively correlated with litter 
turnover rate, but leaf biomass may not correlate 
with litter turnover rate. If litter biomass but not 
leaf biomass or litter turnover rate is informed 


by a dataset (e.g., litter biomass measurement), 
the relationship between leaf biomass and lit- 
ter turnover rate may become positive. Indeed, 
Keenan et al. (2013) found that the number of 
correlated parameter pairs increased with the 
number of datasets used for data assimilation, up 
to six (Figure 29.4). This could be because when 
more datasets are used to constrain the model, the 
dimensionality of model parameters may decrease 
(Luo et al. 2009), resulting in more apparent cor- 
relations between some parameters. In the study 
of Keenan et al. (2013), using more datasets, in 
addition to these six, did not significantly change 
parameter correlations (Figure 29.4), as the cor- 
relations seemed to be mainly determined by the 
structure of the model. 


PREDICTION OF LAND CARBON DYNAMICS 
AFTER DATA ASSIMILATION 


Model predictions can be improved by data 
assimilation. Data assimilation can reduce uncer- 
tainties in predicted land carbon dynamics dur- 
ing both hindcasts and forecasts. Similar to model 
parameters, uncertainties in predicted land car- 
bon dynamics generally decrease with the use 
of more data; meanwhile, the uncertainties can 
be reduced more using some datasets than using 
some other datasets. Assimilating data such as ten 
years’ measurements of land carbon pools into 
a model can substantially reduce uncertainties 
in predicted land carbon pools, especially for 
fast-cycling carbon pools (e.g., foliage and fine 
roots). Dynamics of slow-cycling carbon pools 
(e.g., passive soil carbon pool) may not be well 
informed by short-term (i.e., < 10 years) mea- 
surements of pool size, because a slow degrad- 
ing pool would not be expected to exhibit much 
change over such a relatively short period of 
measurements. The '*C signature of soil carbon 
pools, which can indicate the age of carbon 
in soil, could provide a solution for inform- 
ing the dynamics of slow-cycling carbon pools. 
Measurements of soil '*C signature and models 
that have incorporated such measurements are 
still uncommon, however. 

Data assimilation can reduce not only the 
uncertainty but also the bias of model predictions 
during forward runs. The impact of data assimi- 
lation on model predictions during forward runs 
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may be evaluated by splitting data streams into 
two time periods. Data streams during one time 
period are used for model calibration and data 
assimilation, with the other time period used for 
data validation. The bias of model prediction may 
be assessed using root mean square error, which 
measures the difference between data and model 
predictions. When forecasting future land carbon 
dynamics, the maximum likelihood estimates of 
land carbon pools can be altered by data assimila- 
tion. The estimates after data assimilation are sup- 
posed to be more reliable than those before data 
assimilation. If forecasted in a near term (e.g., days 
to years), the forecasts may be validated by future 
measurements as they become available. 


SUMMARY 


Data assimilation is useful to extract informa- 
tion from model and data. The information con- 
tents of models and data can be quantified using 
the Shannon information index. A good forward 
model is fundamental to long-term predictions of 
land carbon dynamics. Data is particularly impor- 
tant in constraining predictions of short-term land 
carbon dynamics. While the amount of informa- 
tion generally increases with the amount of data, 
some types of data are more useful than others 
in informing models. When some model param- 
eters cannot be constrained by available mea- 
surements, it suggests the need for new types of 
measurements or a simpler model. Overall, data 


assimilation exercises can aid model development 
and guide data collection to constrain model pre- 
dictions of land carbon dynamics. 
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QUIZZES 


1. What kind of metric can be used to quantify the 
information content? 


2. How does the contribution of a model to pro- 
jected carbon pool sizes change with projection 
time? 

3. Do more land carbon measurements mean more 
information on model predictions on land car- 
bon dynamics? 


4. How can correlations and covariances between 
posterior values of parameters be used to explore 
model equifinality? 


5. How can data assimilation help model predic- 
tions of land carbon dynamics? 
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This chapter demonstrates how mathemati- 
cal models and data-model fusion can be used 
to identify the processes controlling carbon (C) 
dynamics. A suite of process-based models was 
built to describe the C dynamics in a temperate 
lake. The models were built so as to create a facto- 
rial modeling experiment aiming to identify the 
processes that contributed most to explaining the 
observed variation in the observed data. All models 
were calibrated using the Bayesian Markov Chain 
Monte Carlo algorithm and their fit indices were 
used to identify the processes that contributed 
most to model improvement. Although the exam- 
ple used in this lecture focuses on C dynamics in 
a temperate lake, the presented hypothesis-testing 
algorithm is transferable to other response vari- 
ables and ecosystems. 


MODELS AND DATA-MODEL FUSION 


Ecosystem models are useful tools for exploring 
how sensitive a variable of interest is to changes 
in environmental conditions as well as projecting 
its state into the future. A process-based model is a 
mathematical expression of our theoretical knowl- 
edge that was initially obtained through observa- 
tional and manipulative experiments. Models are 
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typically developed in three stages. The first stage 
is model formulation: we express the theoretical 
knowledge about the dynamics of the response 
variables as a system of mathematical equations. 
The second stage is model calibration: the param- 
eters in the mathematical equations are tuned so 
as to improve representation of the observations. 
The initial values of the parameters, that are tuned 
up or down in the calibration stage, are typically 
obtained from the literature or prior experimental 
data. The last stage is model validation: model per- 
formance is evaluated against a set of observations 
that were not used for model calibration. 

We may run into issues during the last stage 
of model development: our model may not per- 
form well when evaluated against the observations. 
Once this becomes an issue, one may be tempted 
to attribute the poor model performance to inac- 
curate representation of processes controlling 
the dynamics of the response variable. To remedy 
this, we might be tempted to add more processes 
into the model. However, adding new processes 
increases model complexity and may lead to over- 
fitting, i.e., capturing the noise in the observations 
and not the signal. In addition, an increase in the 
number of model parameters (which is an inevita- 
ble consequence of increasing model complexity) 
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may inflate the uncertainty in the model predic- 
tions, thus reducing its utility. 

To reduce the risk of overfitting and uncertainty 
inflation, it is important to implement the second 
stage of the model development process — model 
calibration — by using a data-model fusion (or data 
assimilation) approach. A data-model fusion algo- 
rithm, such as 3DVAR, 4DVAR (Lewis et al., 2006) 
Ensemble Kalman Filter, or Bayesian inversion 
(Evensen, 2009), finds such parameter values that 
either minimize the least-squares function, or max- 
imize the likelihood function. In other words, these 
algorithms find such values for model parameters 
that result in the closest match between the model 
output and observations. Because these algorithms 
for model calibration are not restricted by the 
model structure, they have been adopted by atmo- 
spheric scientists, hydrologists, as well as biogeo- 
scientists as a formal model calibration approach. 

A model calibrated using a data-model fusion 
approach performs to the best of its ability, there- 
fore if a calibrated model still performs poorly, 
the poor performance can be attributed to the fact 
that the chosen model structure is not optimal for 
the given system. Such properties of data-model 
fusion provide grounds for accepting or rejecting 
a structural model hypothesis based on its per- 
formance, thus opening an avenue for learning 
about the processes that best describe a system 
(Figure 30.1). So, if we have a number of compet- 
ing hypotheses about the processes controlling the 
dynamics of our response variable, we can build a 
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Figure 30.1. Conceptual representation of model-based 
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hypothesis testing. Mechanistic hypotheses about the pro- 
cesses within an ecosystem are expressed as a process-based 
model; model parameters are calibrated using a data assimi- 


lation technique and a portion of observations reserved for 
calibration; afterwards the calibrated model is evaluated on 
the portion of observations reserved for validation, and the fit 
index is used to accept or reject the mechanistic hypothesis 
that was represented in the model. 


suite of mathematical models that represent these 
processes, calibrate them against the observations, 
and evaluate which process representations (and 
the hypothesis they entail) consistently improve 
model performance. Such a multiple-hypothesis 
testing framework aligns with principles of 
“strong inference” (Platt, 1964), which have led 
to rapid progress in science, particularly in the 
field of molecular biology. 

In this chapter, we study an example of imple- 
menting a multiple hypothesis testing approach 
to learn about the mechanisms controlling the 
dynamics of epilimnetic carbon (C) in Long Lake, 
a temperate lake located at the University of Notre 
Dame Environmental Research Center. 


PROCESSES THAT MAY CONTROL EPILIMNETIC 
C DYNAMICS 


To start, let us define the general model structure 
and explore how it can be modified to test various 
structural model hypotheses. There are two forms 
of C in the epilimnion, or the portion of the lake 
above the thermocline: dissolved organic C (DOC) 
and dissolved CO,. Similar to the dynamics of the 
C pools in terrestrial ecosystems (see Chapter 1 
“Theoretical Foundation of the Land Carbon Cycle 
and Matrix Approach”), dynamics of these two 
aquatic C pools can be expressed as a system of 
first order linear differential equations: 


(30.1) 


where X(t) is a vector of epilimnetic DOC and CO, 
pools at the time t; I(t) is the vector of inputs to the 
DOC and CO, pools, which includes stream, pre- 
cipitation, and overland flow; and A(t) is the C loss 
and transfer matrix. The A(t) matrix combines the 
effects of biotic and abiotic factors on DOC miner- 
alization; hydrologic C export, or loss of DOC and 
CO, with outflow; atmospheric exchanges; and 
fluxes to or from the hypolimnion, or the portion 
of the lake below the thermocline. 

The model structural assumptions (or hypo- 
theses) that are tested here affect the size of X(t), 
and the elements of I(t) and A(t) (Figure 30.2). 
The evaluated model structural hypotheses are 
listed below. These hypotheses and subhypotheses 
are combined so as to create a factorial experi- 
mental design, yielding 40 model variants in total. 
Once the model variants are calibrated against the 
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Figure 30.2. Diagram depicting a subset of the ensemble of model formulations (represented as terminal nodes). Fach model 
formulation is calibrated against the time series of DOC and CO,, and the processes that improve the model fit are identified via 
modeling the fit indices as a function of presence or absence of a process. PO: model variant with photooxidation; DC: model 
variant with representation of density currents; DG: model variant with the representation of the depth gradient in hypolim- 
netic DOC and CO,. 


HYPOTHESIS 1 where subscripts or superscripts l and r 
denote process affecting labile or recalci- 
trant DOC; subscript “e” denotes epilim- 
netic C pools; Q,,,(t) is total water discharge 
out of lake via groundwater and outlet at 
the time t; k, is the mineralization rate of 
the labile DOC; k, is the mineralization rate 
of the recalcitrant DOC, and evasion(t) is the 
outgassing of DIC to the atmosphere calcu- 


Accounting for varying DOC lability 
improves model performance. Even though 
DOC may be treated as a labile form of 
organic matter in terrestrial ecosystems, 
incubation studies have indicated a range of 
lability of DOC in aquatic ecosystems. The 
model variant that represents heterogeneity 
in lability of DOC includes two DOC pools: 


a labile and recalcitrant one. To incorporate kredas evasion (t) a KCO, () kCO,(t) is gas 
two DOC pools, a third row is added to the Zthermo 
X(t) vector as well as I(t) vector and A(t) transfer rate for CO, in m day”! and Zpermo iS 
matrix from Equation 30.1: the average daily thermocline depth in m. 
| apoc! (t) 
Te (9 Inoe, (t) | | Quu (t) + ki 0 0 DOC: (t) 
: =| Ivoc, (t) |- 0 Qon (t) +k, 0 DOC: (t) | (30.2) 
dt A 
WIC, (0) Inc (1) —k, —k, Qout (t) + evasion (t) DIC, (t) 
es a 
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HYPOTHESIS 2 


Representing gravity-driven density cur- 
rents in the lake improves model perfor- 
mance. The Long Lake is fed by a stream, 
which also delivers DOC and CO, into 
the lake. Depending on the concentration 
of the solutes in the stream water and its 
temperature, its density may be higher or 
lower compared to that of the epilimnion. 
If stream water density is lower or equal 
to that in the epilimnetic water, the sol- 
utes will be delivered to the epilimnion. 
However, if stream water density is higher 
compared to that in the epilimnion, the sol- 
utes will sink to the hypolimnion. In the 
latter case, incoming DOC and CO, will 
essentially be locked in the lake, whereas 
in the former case CO, will escape to the 
atmosphere (given that the water is super- 
saturated with CO,), and DOC will be trans- 
formed into CO,, which, in turn will escape 
to the atmosphere. To test this hypothesis 
elements of I(t) are modified by a function 
f(Ap), which allocates less stream C to the 
epilimnion as the water density gradient Ap 
increases: 


| dDOC, (t) 
dt “tims (t)| oc (t) le 
dDIC, (t) z z 
L dt 
[Qua (t)+k 0 
—k OT (t) + sul 
ok J (30.3) 
DIC, (t) 


where Ap(t) is the difference between 
stream and epilimnion densities at time t; 
kins is sensitivity of epilimnion DOC and 
DIC load to differences between stream 
and epilimnetic water densities; and k is 
the DOC decay rate. 


HYPOTHESIS 3 


Representing photooxidation of DOC 
improves model performance. Water incu- 
bation studies have shown that DOC is 
depleted at a higher rate in the presence 
of ultraviolet (UV) light compared to that 
in the dark incubations. Indeed, UV light 
may aid in oxidation of organic molecules 
and break covalent bonds in large organic 
molecules, thus increasing their susceptibil- 
ity to microbial uptake and mineralization. 
Here, the effect of photooxidation is tested 
through modifying the elements of A(t) by 
an empirical function of photosynthetically 
active radiation, f(PAR): 


"DOC, (t) 

Efe. 

adic. (e) | E d 
L d 
"Qu (t) +k + £ (PAR) 0 

—k — f (PAR) Qout (t) + aft 
DOC. (t) 
Baal (30.4) 
knw a) 
where f (PAR) = kbase phe Oe PAR(t) 


is photosynthetically active radiation at the 
time t in mmol photons m "day !; kbase no is 
base photodegradation rate, and k hoo 
sitivity of DOC degradation to PAR. 


is sen- 


HYPOTHESIS 4 


Representing entrainment of hypolimnetic 
water and a depth gradient in hypolimnetic 
CO, and DOC concentrations improves 
model performance. When the thermocline 
deepens, some hypolimnetic water moves 
into the epilimnion. The hypolimnetic water 
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is an additional source of CO, and DOC that 
is not accounted for by the default elements 
of the I(t) vector in Equation 30.1. In addi- 
tion, entrainment can be a significant source 
of nutrients to the epilimnion and enhance 
nutrient-limited productivity and DOC 
decay rates. The distribution of DOC and 
CO, in the hypolimnion may not be uniform 
across depths, which can be treated as two 
additional sub-hypotheses of Hypothesis 4. 
For instance, CO, concentrations tend to be 
higher at greater depths in hypoxic lakes. 
If DOC and CO, concentrations vary across 
depths, the entrainment fluxes of C into the 
epilimnion may be biased, and explicitly 
testing for the nonuniformity of concentra- 
tions of these solutes will correct the bias. To 
test the entrainment and nonuniformity of C 
concentrations hypotheses, elements of I(t) 
were modified so as to include entrainment 
of the hypolimnetic water that resulted from 
variations in thermocline depth. For detailed 
equations as well as R scripts implementing 
this structural hypothesis please see the sug- 
gested reading. 


observations, their performance is evaluated and 
the processes that consistently improve model per- 
formance are identified. 


MODEL CALIBRATION AND SELECTION 


To ensure the model variant performs to the best 
of its ability it needs to be calibrated using a formal 
data-model fusion approach. The parameters in 
the models are calibrated using a Bayesian Markov 
Chain Monte Carlo (MCMC) technique, which is 
a variant of the Bayesian inversion. Mathematically, 
Bayesian inversion can be summarized as: 


p(Z) =v. x p(Z|c) x p(c) (30.5) 
where p(c| Z) is the posterior probability density 
of parameters c; p(Z|c) is the likelihood func- 
tion of parameters c; p(c) is the prior probability 
density of parameters c; and v, is a normalization 
constant. If the errors are assumed to be nor- 
mally distributed, the likelihood function can be 
expressed as: 


2 N (2, o J 
p(z] c) = y, Xexp AS pa i (30.6) 
ij 


j=l isl 


where Z, ; is the jth observation type at ith time 
point (there were two observation types: DOC and 
CO, pools); X, j is the model output for jth obser- 
vation type at ith time point; N is the total num- 
ber of jth observation type; o; is the variance of 
jth observation type at ith time points; and v, is a 
constant. There are two observation types used in 
this study: (1) two-year time series of epilimnetic 
DOC pool; and (2) two-year time series of epilim- 
netic CO, pool. These observations are split into 
two groups: the measurements collected in the 
first year are used to calibrate the model parame- 
ters, and the measurements collected in the second 
year are set aside for use in evaluating the perfor- 
mance of the models with calibrated parameters. 
Such design helps control for overfitting, ensuring 
that the best-performing models capture the signal 
rather than the noise in the observations. 

The prior distributions for the parameters (p(c)) 
are assumed to be uniform. The posterior parameter 
distributions p(c| Z) are created by sequentially pro- 
posing parameters that are randomly sampled from 
their prior distributions and accepting or rejecting 
them using the Metropolis criterion (Spall, 2005). 
The parameters in the posterior distributions may 
be correlated, which is often the reason for reduc- 
tion in the acceptance rate of the proposed param- 
eters. To account for the correlations among the 
parameters and increase the parameter acceptance 
rate, the proposal distribution is set to uniform only 
for the first 10,000 iterations (the number of itera- 
tions given here yielded optimal parameter accep- 
tance rate, however the reader can experiment with 
the initial number of iterations). Afterwards, the 
accepted parameters are used to estimate the param- 
eter covariance matrix, and the proposal distribu- 
tion for the next 490,000 simulations is switched 
to the multivariate normal: 


e = Nà) (30.7) 


where C, is defined as 


Co k=10,000 
C, = 30.8 
i Sá cov(¢,...,t) k > 10,000 ( ) 
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where s, = 2.38/ Vn; n is the number of parameters 
in a model variant; and C, is the parameter cova- 
riance matrix constructed from the first 10,000 
iterations. The Equations 30.7 and 30.8 summarize 
the parameter sampling technique in the Adaptive 
Metropolis algorithm (Haario et al., 2001), which 
is aimed at optimizing parameter acceptance rate. It 
is important to monitor the parameter acceptance 
rates and keep them between 23% and 44%, because 
(1) high acceptance rates lead to misrepresentation 
of the tails of the posterior parameter distributions, 
and (2) low parameter acceptance rates may lead to 
a computationally infeasible increase in total number 
of simulation, which will be necessary for obtain- 
ing sufficient a number of samples and construction 
of posterior parameter distributions (Gelman et al., 
1996). The first half of the accepted parameters is 
considered to be a “burn-in” period and should be 
discarded to avoid using model parameters from 
a nonstationary posterior distribution to calculate 
maximum likelihood parameter estimates. The sec- 
ond half of the accepted parameters can be used to 
generate the marginal posterior parameter distribu- 
tions and calculate the maximum likelihood param- 
eter estimates. Lastly, the model output produced 
with the maximum likelihood parameter estimates is 
used to calculate the model fit index, which in turn 
will help evaluate which model structural assump- 
tions consistently improve model performance. 

The model fit index used in this chapter is the 
Akaike information criterion (AIC, Akaike, 1974), 
which is calculated as: 


AIC = 2n—2In(L) (30.9) 


where n is the number of parameters in a model 
variant and L is the likelihood function (p(Z|c) 
from Equation 30.5) calculated using the portion 
of observations reserved for model validation. With 
102 model formulations, identifying processes that 
consistently improve model performance is not 
trivial. However, if we treat AIC as a function of 
presence or absence of a particular process in the 
model (i.e., use binary predictors to model AIC), we 
can test the significance of the coefficients associ- 
ated with the presence of a particular process. If the 
coefficients are significantly smaller than 0, repre- 
senting a process in the model improves its perfor- 
mance. Such analysis essentially performs a linear 
regression on the AICs. However, a linear regression 
model has to satisfy a suite of assumptions, such 


as linearity of the relationship and homogeneity 
of residual variance. Therefore, a nonparametric 
method for identifying the processes that improve 
model performance may be preferable. 

Here, we will use a machine learning tech- 
nique, recursive partitioning by conditional infer- 
ence (Hothorn et al., 2006), to identify which 
processes consistently reduce the AIC values, thus 
indicating a consistent improvement in model 
performance. Unlike many recursive partitioning 
methods, conditional inference trees do not overfit 
the models and are not biased towards covariates 
with many possible splits. This method is available 
in the R “partykit” package (Hothorn et al., 2015), 
which is used here to analyze the AIC values. 


PROCESSES THAT CONTROL EPILIMNETIC C 
DYNAMICS 


Multiple hypothesis testing revealed that represent- 
ing physical controls of epilimnetic C dynamics 
consistently improved model performance, whereas 
refining the effect of biological controls did not 
affect it significantly (Figure 30.3).The conditional 
inference machine learning algorithm identified 
a cluster of twenty models that had significantly 
lower AIC values compared to other model vari- 
ants. All these twenty models represented entrain- 
ment of hypolimnetic water, a depth gradient in 
hypolimnetic CO, content, and density-dependent 
input partitioning of DOC and CO, load between 
epi- and hypolimnion. Notably, representing het- 
erogeneity in DOC lability neither improved nor 
worsened model performance, which was also true 
for photooxidation of DOC. The lack of sensitivity 
of model performance to these structural assump- 
tions likely indicates that the observations do not 
contain information that could be used to evaluate 
these hypotheses. In summary, multiple hypothesis 
testing led to accepting hypotheses #2 and #4. 
The question that remains is: how well do the 
best-performing models represent the observed 
variation in epilimnetic C dynamics? Since the 
performance of the top 20 models in simulating 
temporal variation in lake C pools did not vary 
substantially, we include the illustration of per- 
formance of only one model (Figure 30.4). The 
chosen model variant does not represent photo- 
oxidation and heterogeneity in DOC lability. The 
model explained a substantial portion of the tem- 
poral variability in the epilimnetic CO, and DOC 
pools in the validation data set (81% and 64%, 
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Figure 30.3. Model structural assumptions that significantly reduced AIC. “Yes” and “no” indicate presence or absence of a pro- 
cess or assumption in a model. Thick gray lines highlight a path to a cluster of best-performing models (Node 9). Vin hypoDIC 


denotes the assumption that hypolimnetic CO, concentrations were nonuniform across depth and by default includes the repre- 


sentation of entrainment in model variants; load x f[Ad(t)] indicates that stream loads of DOC and CO, are partitioned between 


the epilimnion and the hypolimnion as a function of the water density difference between the stream and the epilimnion. 
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Figure 30.4. Performance of the best models in representing epilimnetic CO, and DOC that were not used for model calibra- 
tion. Gray shaded region represents 2 standard deviations from the model mean, black dots are observed data points. Best- 


performing models explained 81% of temporal variation in epilimnetic CO, pool and 64% - in DOC pool. Although there were 


20 model variants that were “best-performing” (Figure 30.3), their representation of the two observed C pools did not vary 
substantially, therefore performance of only one model is illustrated. 


respectively; Figure 30.4) and most observations 
fell within two standard deviations of the mean 
predicted value. Thus, model-data fusion com- 
bined with multiple hypothesis testing approach 
helped build a well-performing model for simu- 
lating epilimnetic C dynamics. 

Although the example in this chapter focuses 
on lake biogeochemistry, the general framework 
of model-based multiple hypothesis testing could 


potentially be applied in terrestrial biogeochem- 
istry or many other fields as it is not limited to 
a particular model structure. Given that the avail- 
able observations are informative for the hypoth- 
eses being tested, much knowledge can be gained 
about mechanistic controls of a variable of interest. 
The results of the analysis described here may also 
indicate that the observations are noninformative 
for evaluating whether a particular mechanism is 
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important, which may manifest as lack of response 
of the fit index to the additional processes. Such 
negative results may provide guidance for future 
experimental design so as to maximize informa- 
tion content in the collected observations. 
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QUIZZES 


1. What are the three stages of model development? 

2. What is the objective of data-model fusion? 

3. How does data-model fusion facilitate strong 
inference? 

4. What are the criteria to accept or reject a 
hypothesis? 
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Large model uncertainty in projected future soil 
carbon (C) dynamics has been well documented. 
However, our understanding of the sources of this 
uncertainty is limited. This chapter is designed to 
illustrate the projection uncertainties induced by 
model structures and parameter values. Three rep- 
resentative soil carbon models are compared in 
terms of their predictions of soil carbon change in 
a future climate warming scenario. The parameter 
values in each model were derived by fitting mod- 
eled soil carbon to observations. We see that uncer- 
tainty often increases with complexity of a model. 
The larger uncertainty in the complex models 
suggests that we need to strike a balance between 
model complexity and the need to include diverse 
model structures in order to forecast soil C dynam- 
ics with high confidence and low uncertainty. 


INTRODUCTION 


The structure and parameter values of a model 
together determine its predictions, and both contrib- 
ute to model uncertainty. Differences in structure and 
parameters among global land carbon models give 
rise to projection uncertainty, which is typically high 
across models in future scenario studies. Past studies 
have linked different model structures to differences 
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om soil C projections (Wieder et al., 2013), and 
parameterization (i.e., the choice of parameter val- 
ues) has been shown to cause large uncertainty in 
projected change in soil C (Hararuk et al., 2015). 
In other words, parameterization is likely to interact 
with model structure to impact a model's projections. 

Soil C decomposition models, in particular, are 
becoming more diverse. For example, recent inno- 
vations include microbial models that simulate 
decomposition processes with explicit microbial 
traits, and models with dynamics that explicitly 
account for interactions between soil layers at dif- 
ferent depths (Koven et al., 2013; Wieder et al., 
2015). Exploring uncertainty generated by model 
structures and parameterization is critical for 
global land C modeling, but until now, relatively 
little effort has been dedicated to addressing it, in 
part due to enormous computational cost. 

Here we demonstrate how a Bayesian framework 
in which the posterior parameter distributions are 
sampled via a Markov Chain Monte Carlo (MCMC) 
technique, can overcome the computational hurdle 
and generate a large ensemble of potential param- 
eter sets within reasonable physical and biological 
boundaries. Applied to the soil carbon system, the 
approach relies on the simplifying assumption that 
the constituent SOM pools are at steady state. This 
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Figure 31.1. Soil carbon decomposition models with distinct structures. (a) Conventional Century-type model of CLM 4.0; (b) 
Vertically-resolved model in CLM 4.5bgc; (c) Microbial-Mineral Carbon Stabilization (MIMICS), which represents microbial 
biomass explicitly. MIC: microbial biomass; SOM: soil organic matter. 


ensemble generates similar current global soil car- 
bon content compared with the Harmonized World 
Soil Database (HWSD) and Northern Circumpolar 
Soil Carbon Database (NCSCD). We choose three 
representative soil C decomposition models with 
distinct structures, suitable for application across a 
global grid (Figure 31.1). 

The three models include the conventional 
soil C decomposition model embedded in the 
Community Land Model version 4.0 (CLM 4.0), 
a soil C decomposition model with explicit soil 
depth embedded in CLM 4.5 (Koven et al., 2013) 
and a microbial model (the MIcrobial-MIneral 
Carbon Stabilization: MIMICS; Wieder et al., 2015). 
A projection based on the RCP (Representative 
Concentration Pathway) 8.5 emissions scenario 
was carried out on a global grid for all the three 
models with their posterior parameter ensembles. 


ALTERNATIVE MODEL STRUCTURES 


The conventional model (Figure 31.1a) represents 
soil C decomposition using three C pools and can 
be written in a matrix form (Chapter 5) as: 


264 


z =1+ AE (t) KX(t) 


(31.1) 


I = [i, i, iz] is the litter input to the three soil 
carbon pools (labile, slow and passive soil C). 


=1 fo fis 
A (=|  -1 0 |) is a transfer matrix 
hı hz =1 


among soil C pools, where f,, is derived using the equ- 
ation f, = 1 —a—f,, where a = a, — a, X (1- sand%) 
where sand% is sand proportion. The respiration 
coefficients can be derived by 1-f,, for each carbon 
pool. K = [k,,k,,k,] is baseline turnover rate (yr”!) 
of soil C pools. These baseline rates are modified 
by the environmental modifier, €, which is a prod- 
uct of temperature scalar, soil moisture scalar, soil 
nitrogen scalar and oxygen scalar. X = [x,, x, x;| 
is the time-varying soil C content in each of the 
three pools, the state variables of the model. To 
be consistent with the other two models, we only 
calculate temperature scalar using the equation 
Qro 729/10 where Q,, is temperature sensitivity 
of decomposition and T,,, is soil temperature. The 
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rest of the environmental scalars are set to their 
average across the ten soil layers from CLM 4.5. 
There are in total ten global parameters (f's, k's, 
a,, a, and Q,,) in the conventional model. 

For the vertically-resolved model in CLM 
4.5bgc, a matrix equation can be used to repre- 
sent soil C dynamics among three pools within 
each soil layer over a vertical profile of ten soil 
layers (totaling 30 pools, Figure 31.1b). The 
three pools within each soil layer are labile, slow 
and passive soil C. There are in total ten soil lay- 
ers with diffusivity among the layers. The matrix 
equation is: 


Xora) 31.2) 


Here, I, A, £, K and X correspond to the same 
model components as in the conventional model, 
above, but differ in dimensionality to account for 
the presence of explicit soil layers. I is litter inputs 
to soil C pools I = [1, , La -Ian --- 15,19], m is the 
soil C pool, ranging from 1 to 3, n is the soil layer 
ranging from 1 to 10. 


A 


Ajo 


is a block diagonal transfer matrix with dimension 
30 by 30 (three carbon pools per soil layer for ten 


=1 fo ha 
layers), A, (=| f£, —1 0 isa block matrix 
fy hz =l 


with L being the soil layer, taking a value from 1 
to 10. The dimension of A, is 3 X 3 with elements 
fip in which i is a receiving pool, j is a donating 
pool, and the blank in the matrix A are zeros. £ is 
environmental modifier, a product of temperature 
scalar, water scalar, depth scalar, and oxygen scalar 
as in Koven et al.,(2013); K is the baseline turn- 
over rates for the soil C pools; X is C concentration 
for the 30 pools, X = [x11 Xo.) ---Xmn ++» X310], M İS 
the soil C pools, ranging from 1 to 3, n is the soil 
layer ranging from 1 to 10; 


Ya Ya = 
Va. Va. 
Va Viim 
V30x30 = Vam 
a iara Tu 
Vos Vo1o 
L Vioo  Viouo] 
0 
is a block tridiagonal matrix to represent C diffusivity between soil layers. Vam] = V2.2 


V3,3 


is the fraction of a given C pool at a given soil layer being transferred into upper and lower soil layers, 


Vm-1,m 


Vm—=1,m 


Vasim = 


is the received fraction of carbon transferred from 


Vm—=1,m 
Vm-1,m 


Vm-1,m 


Vm-1,m | 


lower soil layer m to layer m—1; 
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Vm+1m 


Vath = 


is the received fraction of carbon transferred from 
the upper soil layer, m, to layer m+1. Also, Vm m 
Vm-1m + Ym+im for a given m. V can be approximated 
using the model parameter diffusivity (D, for 
non-permafrost diffusivity and D, for permafrost 
diffusivity). Note that compared to the environ- 
mental modifier in the conventional model, there 


is an additional scalar, the depth scalar (r,) which 


Z . 
is computed with r, = exp| —— | where zz is the 
Ir 


e-folding depth. To be consistent with the other 
two models, we only calculate temperature scalar 
using the equation Q; 725/10) where Q,, is tem- 
perature sensitivity of decomposition and T,,, is soil 
temperature. The rest of the environmental scalars 
are direct outputs from running CLM 4.5.Therefore, 
there are 13 global parameters (f's, k’s, a,, a, Quo; 
D,, D,, 27) in the vertically-resolved model. 

The Microbial-Mineral Carbon Stabilization 
(MIMICS) model was described by Wieder et al. 
(2015). There are two soil microbial C pools 
(MIC, and MIC,) and three soil C pools, available 
soil C (SOM,), physically-protected C (SOM,) 
and chemically recalcitrant C (SOM,) (Figure 
31.1c). Michaelis-Menten equations are adopted 
to describe soil C uptake by soil microbes. The 
dynamics of the soil C can be represented by the 
following equations: 


SOM, = Ri-p + Rnic-p -SOM,xD (31.3) 


SOM, =, Ri. y Rmi 


Uses: (314) 


SOM, =; Rmic-a T Ue t U=; 


31.5 
+ SOM, x D — U,- — Uar he) 


where R, is the input to soil C from litter (R,, = f, 
x total_input and R}, = f, x total input) and R,,,, is 
the input to soil C from microbial decay, D is the 
turnover rate for SOM,, U_, is the uptake of SOMc 


by K-selected microbes, U_, is the uptake of SOM, 


Vin+1,m 


Vin+l,m 


Vm+1,m 


Vm+1,m | 


by r-selected microbes, U, , is the uptake of SOM, 
by K-selected microbes and U,, is the uptake of 
SOM, by r-selected microbes. The uptake process 
takes the form of Michaelis-Menten equation: 


SOM 


U = MIC Vas == —— 
KoKm + SOM 


where MIC is the microbial biomass, Vax is the 
maximum reaction rate, SOM is the soil C content, 
Ko is the modifier for oxidation of SOM, K,, is the 
half saturation constant. For more model details, 
see Wieder et al. (2015). 

Our implementation is based on the original 
MIMICS model, with slight modifications for sim- 
plification. In total, there are 22 global parameters. 
Since the range for most of the parameters is not 
well characterized in the literature, we prescribe 
the minimum of each parameter as the default 
value divided by three and the maximum as the 
default value multiplied by three. In addition to 
the soil C dynamics, the two microbial C pools are 
represented by the equations: 


dMI 
i Ct =R,,+U,,xMGEI=MIC,xr, (31.6) 
t 
dMICk 
a = Rik + Ua- x MGE1— MIC, XTk (31.7) 
t 


where R,, and R, are the input to r- and K-selected 
soil microbes, respectively. 


Rir = (Um + Us- )/ (Umor + Umt +U,- +U,- ) 
x (total _input — Rip ~ Ric ) 
(31.8) 


Ri k = (Umr + Usi )/ (Umar + Unk + Us; + Us) 
x (total _ input — Rip — Ri) 
(31.9) 
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Un and U are the uptakes of metabolic litter 
by r- and K-selected microbes, respectively; and U,, 
and U,, are the uptakes of structural litter by r- and 
k-selected microbes, respectively; all the U's are 
calculated with default parameters in MIMICS and 
litter from CLM 4.5 with the sole purpose of nor- 
malizing the input to r- and K-selected microbes. 
U, , and U,, are the uptake of available soil C by 
r- and K-selected microbes, respectively. MGEI 
is the microbial growth efficiency for uptake of 
SOM,. 7, and 7, are the turnover rates of r- and 
K-selected microbes, respectively. 

Carbon use efficiency or microbial growth effi- 
ciency (MGE) is a key parameter in microbial mod- 
els. However, we do not consider it as a parameter 
in our study since we are using microbial biomass 
data as an input to the MIMICS model. 


DATASETS AND DATA-MODEL FUSION 


The observations we will use are re-gridded top- 
soil organic carbon (0-30 cm) and subsoil organic 
carbon (30-100 cm) from the Harmonized World 
Soil Database (HWSD; https://daac.ornl.gov/ 
SOILS/guides/HWSD.html). The native resolution 
(30 arc sec) in HWSD is re-gridded to match the 
default grid used by the CLM 4.5 model. 

Due to the possible under-estimation of per- 
mafrost soil C in HWSD, we replace it with the 
Northern Circumpolar Soil Carbon Database 
(NCSCD;  http://bolin.su.se/data/ncscd/netedf. 
php) in permafrost areas. The NCSCD was devel- 
oped to quantify the Northern Circumpolar per- 
mafrost soil C stocks down to three meters. There 
are four soil layers in this database, 0-30 cm, 
0-100 cm, 100-200 cm and 200-300 cm. We 
regridded the NCSCD from 1-degree resolution to 
that of the CLM 4.5 grid. 

The microbial biomass C is prescribed as 
the steady state in the MIMICS model, based 
on the Global Microbial Biomass dataset 
(https://daac.ornl.gov/SOILS/guides/Global_ 
Microbial_Ciomass_C_N_Phtml). The steady state 
assumption is justified, given the fast turnover of 
microbes, typically less than one year, which leads 
to fast equilibration of the microbial biomass 
pool. A global parameter, f, (fraction of r-selected 
microbial biomass) is multiplied by total micro- 
bial biomass from the input database to calculate 
the r-selected microbial biomass. The remainder 
is the K-selected microbial biomass. We regrid 
the original 0.5° resolution data to the CLM grid, 
matching the other datasets above. 
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To fit the observations, we use Bayes’ theo- 
rem to estimate parameter values and associated 
uncertainties: 


p(219)xp(9) 


p(z) 


where, p(8 | Z) is the posterior distribution of the 
parameters O given the observations Z. p(Z| 0) is 
the likelihood function for a parameter set cal- 
culated with the assumption that each parameter 
is independent from all other parameters, and 
has log-normal distribution with the difference 
between model and data being a zero mean: 


p(6]z)= (31.10) 


= [z a x] 


P(Z|@) «c exp4 - A 
Oi 


(31.11) 


Here Z is the logarithm ofit soil C observation 
in the observational database, X are the logarithms 
of the carbon pools from the model, and @ is the 
mapping vector that maps the simulated carbon 
pools to observations. X is derived by assuming 
the current soil status is at steady state. We will 
be conservative in assigning errors to the soil C 
with o = 0.5 Z,. For the conventional model, we 
assimilate the data by aggregating all the soil layers 
together which is 0-100 cm in non-permafrost 
regions and 0-300 cm in permafrost regions; 
for CLM 4.5, we assimilate data for 0-100 cm in 
HWSD for non-permafrost soils, 0-100, 100-200 
and 200-300 cm in NCSCD for permafrost soils, 
independently. In contrast to soils elsewhere, per- 
mafrost regions contain a huge amount of car- 
bon stock in deeper soil, which justifies the use 
of deeper soil carbon data for these regions. For 
MIMICS, we assimilate the soil C data down to 
100 cm only due to the explicit 1 m depth param- 
eterization of this model. 

We will assume that the parameters are distrib- 
uted uniformly within their prior ranges. Since 
the range for most of the parameters in MIMICS is 
unknown, we assume the range of the distribution 
to be [0,/3, 30,], where 6, is the default value. 

To obtain posterior probability distributions 
of parameters we will employ the Metropolis- 
Hastings (M-H) algorithm, which is a Markov 
Chain Monte Carlo (MCMC) technique (Hastings, 
1970). A detailed description of the M-H algo- 
rithm can be found in Xu et al. (2006). 

In brief, the M-H algorithm consists of itera- 
tions of two steps: a proposing step and a moving 
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step. In the proposing step, a new parameter set 
9:* is proposed based on the previously accepted 
parameter set 0% and a proposal distribution, 
which is uniform in our study: 


em = 0% +1x(Omax —Omin)/D (31.12) 


where Omax and Onin are the maximum and mini- 
mum values of parameters, r is a random variable 
between — 0.5 and 0.5, and Dis used to control the 
proposing step size and is set to 5 as in Xu et al. 
(2006)*. In each moving step, 0" is tested 
against the Metropolis criterion to examine if the 
new parameter set should be accepted or rejected. 
We treat the first 2500 accepted samples as a burn- 
in period, discarding those samples, then use the 
rest to generate posterior parameter distributions. 
In total, there are 50000 accepted samples to con- 
struct the posterior distribution. 


POSTERIOR DISTRIBUTION OF MODEL 
PARAMETERS 


Outcomes of the probability inversion are shown 
in Figure 31.2. A narrow posterior distribution 
indicates that a parameter is well constrained. We 
can see that the inversion was effective in terms 
of constraining the targeted parameters in the 


three soil C decomposition models. Specifically, 
coefficients (i.e., t, and t,) for calculating f,, 
(fraction of C in fast soil C transferring to slow 
soil C), decay rate of passive soil C (k,) and tem- 
perature sensitivity (Q,,) were well constrained 
in the conventional Century-type model (Figure 
31.2a); observed soil C vertical profiles further 
helped constrain the decay rate of slow soil C 
(k,) in the vertically-resolved model, particu- 
larly for the vertical diffusivity parameters (D, 
and D,) and e-folding depth (z,) (Figure 31.2b). 
Interestingly, the posterior mean of D, was found 
to be larger than D,, as diffusivity in permafrost 
soil was found to be faster than non-permafrost 
soil, which is mainly due to higher cryoturbation. 
The Q,) mean is 1.25 in the conventional model 
and 1.06 in the vertically-resolved model, both 
of which are less than the default value (2), but 
close to empirical values. The transfer coefficients 
(£) were not well constrained in either of the two 
models. 

In MIMICS, parameters related to uptake rate 
(V, and V,) and desorption rate of physically- 
protected soil C (D, and D,), and proportion 
of litter input and the two microbial C pools 
(fœ £ and f) were well constrained (Figure 
31.2c). None of the modifiers (e.g., V mro Vinker Kmre 
and Kako) for calculating uptake rate and half satu- 
ration constant were well constrained. Additional 
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Figure 31.2. Violin plot of the accepted model parameter values in the three decomposition models. (a) Conventional model; 
(b) Vertically-resolved model; (c) Microbial model (MIMICS). The narrower the distribution of a parameter, the better it is 
constrained. Note that the values on the y-axis are normalized to the range of [0, 1] by the corresponding values in the brackets 
under each parameter. In the other words, the real values can be derived by multiplying by the values in the brackets. 
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Figure 31.3. Changes in global total soil carbon under RCP 8.5.1000 parameter sets were randomly sampled from the poste- 
rior distribution of parameters to generate the temporal trajectories in each model. The shaded area represents the 95% con- 


fidence interval generated by 1000 parameter sets. The time step is shown in years, from 2005 to 2100. CON = conventional 


Century-type model; CLM4.5 = vertically-resolved model. 


datasets are especially needed to tease apart mul- 
tiple processes and further reduce the uncertainty. 
Results from this study highlight that data con- 
straints may limit the ability of data assimilation 
to reduce uncertainty in more complicated model 
structures. Isotopic data in C processes show great 
potential to constrain these processes. Indeed, '*C 
soil profiles have been used to constrain transfer 
coefficients and turnover rates at multiple sites; 
isotopic labeling to trace C pathways is another 
powerful tool to provide additional constraints 
to relevant processes, such as distinguishing root 
respiration from total soil respiration, sources 
of input, or proportion of different soil C pools. 
Other additional data, such as soil respiration and 
soil C incubation datasets, are equally valuable 
constraints. We therefore advocate using isotopic 
data and other datasets as complementary sources 
to better constrain model parameters and hence 
projections. 


UNCERTAINTIES IN SOIL CARBON 
PROJECTIONS UNDER RCP 8.5 


To illustrate the impact of parameter uncertainty 
on long-term soil C projection, we performed for- 
ward runs over the 21st century with 1000 sets 
of parameter values drawn from the posterior 
distribution for each model. The soil C input and 
environmental modifiers (except the temperature 
scalar) used to drive the models were derived from 
running original CLM 4.5 under RCP 8.5 scenario, 
with climate forcing from the Community Earth 
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System Model (CESM) for 2005—2100. We output 
monthly litter production from the CLM 4.5 run 
and use it to provide the soil C input to drive our 
soil C models. 

We randomly sampled 1000 parameter sets 
out of the accepted posterior values. With each of 
the sampled parameter sets, we forced the verti- 
cally-resolved model with C input and environ- 
mental modifiers obtained from CLM 4.5 under 
RCP 8.5 from 2005~2100. For the conventional 
and microbial model, we derived total inputs and 
mean environmental modifiers of all the ten soil 
layers to force the two models. For the two non- 
microbial models, we used a monthly time step. 
For the microbial model, daily time step was used 
due to its nonlinear nature causing instability for 
longer time steps. 

The three models projected substantially dif- 
ferent changes and trajectories in global total 
soil C over the 21st century (Figure 31.3). The 
conventional model projected consistent soil C 
loss with the least uncertainty (95% confidence 
interval: -71 ~ —17 Pg). Adding vertical resolu- 
tion or microbial dynamics to the conventional 
model increased the projection uncertainty (95% 
confidence interval: -222 ~ 583 Pg C and -397 
~ 144 Pg C, respectively) as well as the sign of 
the soil C-climate feedback, depending on param- 
eters. The uncertainties in the vertically-resolved 
model or MIMICS are more than ten times larger 
than that in the conventional model. This inter- 
esting result shows that using more parameters 
and more explicit dynamics may lead to a larger 
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prediction uncertainty due to feedbacks in the Qj, (temperature sensitivity of soil C turnover). 
model dynamics. In the vertically-resolved model, predicted soil C 
positively associated with D, (diffusivity in non- 
permafrost soils) and k,, but negatively with k,; 
like predicted soil C, changes in soil C were posi- 
tively associated with D, and k,, but negative with 
Our results show that the initial conditions (S,) k,- Predicted soil C change weakly associated with 
tightly correlated with the projected soil C in the V, (regression coefficient for calculating maxi- 
two nonmicrobial models at global scale, but did mum reaction rate) and D, (coefficient for calcu- 
not correlate well with the changes in soil C in lating desorption rate from physically protected 
any of the models (Figure 31.4). The microbial soil C to available soil C); projected soil C content 
model's initial conditions were not strongly cor- had no significantly linear relationships with any 
related with projected soil C at the global scale. of the model parameters. 
The findings suggest that, in general, uncertainty In summary, this chapter demonstrates the 
in the initial conditions propagates through the importance of model structure and parameteriza- 
simulation to the projection of future soil C, and tion in determining the predicted soil C response 
this propagation is especially evident in the two to climate change. To increase confidence in soil 
nonmicrobial models. C projection, diverse model structures are neces- 
Besides initial conditions, model parameters sary. The vertically-resolved CLM4.5 model and the 
are also able to affect predicted soil C or C changes microbial MIMICS model showed much greater 
directly, or indirectly through influencing initial uncertainty in projected soil C under RCP 8.5, 
conditions following a spin-up (Figure 31.4). In while the conventional model consistently pre- 
the conventional model, predicted soil C did not dicted strong positive C-climate feedback. CLM 
significantly correlate with any individual model 4.5 and MIMICS outperformed the conventional 
parameter; changes in soil C were positively asso- model in terms of estimation in the spatial distri- 
ciated with k, (decay rate of slow soil C), but nega- bution of soil C. However, the larger uncertainty in 
tively with k, (turnover rate of passive soil C) and the projections of soil C by the latter two models 


SENSITIVITY TO INITIAL CONDITIONS AND 
MODEL PARAMETERS 
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Figure 31.4. The relationships between soil C, model parameters and initial conditions. The linear correlations between soil C or 
C changes, and initial conditions (Si) and model parameters under RCP 8.5 in the three models (see Figure for key). Blue circles 
represent positive correlations and red circles represent negative correlations. The size of a circle corresponds to the correlation 
coefficient (legend) between model parameters or initial conditions (horizontal axis) and soil C or C changes (vertical axis). 
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also suggests that we need to strike a balance when 
adding process detail, as increased model com- 
plexity tends to amplify uncertainty. 


SUGGESTED READING 


Shi, Z., Crowell, S., Luo, Y., & Moore, B. (2018). 
Model structures amplify uncertainty in predicted 
soil carbon responses to climate change. Nature 
Communications, 9(1): 1-11. 


QUIZZES 


1. What are the major differences between 
conventional models and microbial models? 


2. Describe the matrix equations for the con- 
ventional model and depth-resolved model. 


3. What are the dominant parameters for mod- 
eled soil carbon and soil carbon changes? 


4. How might we further constrain model pre- 
dictions in more complex models? 
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Land carbon models are useful tools to predict the 
impacts of global change on terrestrial ecosystems. 
However, uncertainties in the model predictions 
are large, and need to be reduced to increase our 
confidence in model predictions. Both carbon 
pools and carbon fluxes are usually measured in 
land carbon studies. This practice helps you under- 
stand the value of pool and flux measurements to 
constrain a land carbon model through data assim- 
ilation. To this end, four exercises are designed, 
which are model run without data assimilation 
and model runs with assimilation of carbon pool 
and/or carbon flux measurements. Through these 
exercises, you will learn how to select appropriate 
measurements to constrain model predictions of 
land carbon dynamics. 


INTRODUCTION 


Land carbon models are useful tools to investi- 
gate global change impacts and explore options 
to mitigate climate change. However, uncertainties 
in model predictions of land carbon dynamics are 
large. There is an urgent need to reduce the uncer- 
tainties and increase our confidence in model 
predictions. Model uncertainties have three major 
sources, which are model structure, model param- 
eterization, and external forcing (see Chapter 21). 


DOI: 10.1201/9780429155659-40 


Data assimilation directly addresses uncertainties 
stemming from model parameterization by opti- 
mizing model parameters for a good model-data 
fit. It also helps shed light on uncertainty stem- 
ming from model structure. 

Both carbon pools and carbon fluxes are usu- 
ally measured in land carbon studies. This practice 
investigates the relative benefits of assimilating 
pool and flux measurements into a model in terms 
of improving our predictive understanding of 
land carbon dynamics. The practice encompasses 
four exercises: model runs without data assimi- 
lation; with assimilation of pool measurements; 
with assimilation of flux measurements; and with 
assimilation of both pool and flux measurements. 
Pool measurements include sizes of leaf, wood, 
and root biomass carbon. Flux measurements 
include net ecosystem exchange (NEE), gross pri- 
mary production (GPP), and ecosystem respira- 
tion (Reco). 

The model runs are performed in the train- 
ing software CarboTrain by selecting unit 8 and 
Exercises 1-5. Note that a full data assimilation 
run in this practice will take about eight hours to 
execute, depending on the specifications of your 
personal computer. Therefore, Exercises 1-4 are 
based on prepared model outputs. Only Exercise 5 
involves a full data assimilation run. 
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Model run without data 
assimilation 


EXERCISE 1: 


A model run without data assimilation is use- 
ful as a starting point. It can give information 
of the model and the data assimilation system, 
and acts as a baseline to compare to model runs 
with assimilation of pool and/or flux measure- 
ments. As described in Chapter 29, the struc- 
ture and prior parameter uncertainty of a model 
constitute the prior knowledge about a system. 
Prior parameter uncertainty is characterized by 
the range and prior probability of a parameter. 
The range of possible values for a parameter is 
usually defined by knowledge from the litera- 
ture or observations. The prior probability of a 
parameter is often assumed to follow a uniform 
distribution between a minimum and maxi- 
mum bound, but can also have another distri- 
bution (e.g., Gaussian) within the range of the 
parameter. In addition to the prior range and 
distribution, the choice of cost function and 
the method to generate new sets of parameters 
at each step of simulation can affect the pos- 
terior distributions of model parameters, and 
predictions based on these. 

To do this exercise, select unit 8 and Exercise 
1 in the main window of CarboTrain. Select 
the output directory, and click the Run Exercise 
button. Model outputs will then appear in the 
DA with no obs.zip folder in the out- 
put directory you set. 

Before looking at the model outputs, 
let’s get familiar with the model files and 
workflow of the data assimilation system. 
You can explore the data assimilation sys- 
tem by looking into the Source code/ 
TECO 2.3 folder. For this exercise, you use 
the name-list file workshop _nml/teco_ 
workshop da no obs.txt. Settings in 
this file are passed to the script of TECO v2.3 
(open the file Source code/TECO 2.3/ 
TECO2.3.£90 with a text editor such as 
Notepad++ to view the Fortran source code 
of TECO v2.3). Parameter default values and 
ranges may be found in the Source _code/ 
TECO 2.3/input/folder. The TECO model 


will use parameter values in SPRUCE pars. 
txt as the initial parameter values. Minimum 
and maximum values of parameters to build the 
prior parameter uncertainties are sourced from 
SPRUCE da pars.txt. The latter file also 
specifies whether a parameter is fixed or not. 
If the value under a parameter is 0, it means 
the parameter is fixed and will not change dur- 
ing the model run. If the value under a param- 
eter is 1, it means the parameter value will be 
sampled from its prior distribution during the 
model run. Since no measurements are selected 
for the current exercise, all prior parameter 
values created at the parameter generation step 
(Lines 1065-1082 in TECO2.3.£90) will 
be accepted as the posterior parameter values 
(Lines 8083-8086). 

Change the directory back to DA with 
no obs.zip (may need to be unzipped 
before opening) to examine the model files for 
this exercise. You can find model input files for 
this exercise in DA with no obs/input, 
model output files in DA with no obs/ 
output, an R script used for visualizing 
model outputs in DA with no obs/R 
code for DA Unit 8.R,and plots show- 
ing model outputs in DA with no obs/ 
plot /DAUnitg8. 

Subfolders under DA with no obs/ 
output receive model outputs from forward 
runs (subfolder SPRUCE), model outputs 
without data assimilation (subfolder DA_ 
nomeasure), model outputs from assimila- 
tion of carbon pool measurements (subfolder 
DA _cpool), model outputs from assimilation 
of carbon flux measurements (subfolder DA _ 
cf1ux), and model outputs from assimilation 
of both carbon pool and carbon flux measure- 
ments (subfolder DA_cpool_cflux). 

In this exercise we wish to examine out- 
puts from a model run with default parameter 
values and ranges. If you open the folder DA 
nomeasure, you will find the Paraest. 
txt file, which contains the posterior param- 
eter values. The folder also contains files such 
as Simu_dailyflux001 and Simu_ 
dailyflux50, which contain forward 
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Figure 32.1. Results after model run without data assimilation. (A) Posterior distributions of parameter values. (B) Information gain. (C) Model simulations. Full name and units 
of parameters can be seen in “Source_code/TECO_2.3/annotations in TECO model/parameter_file_annotations .jpg”.In (C), points and error bars 
(not clearly visible as range is smaller than the point size) indicate the mean and estimated standard deviation of measurements, respectively. Black line, deep gray band, and light 
gray band indicate the mean, first to third quantiles, and range of 50 simulations, respectively. RMSE indicates root mean square error. 
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simulations of land carbon dynamics using 
randomly selected sets of posterior parameter 
values. 

Posterior distributions of model param- 
eters, and predictions based on these, are 
available in the folder DA with no obs/ 
plot/DAUnit8. The file “1 Density 
no measurement .png” corresponds 
to Figure 32.1A and B, below, while “1 
Simulation no measurement .png” 
corresponds to Figure 32.1C. For each param- 
eter, posterior values tend to cluster near the 
mean value rather than distribute uniformly 
within the parameter range (Figure 32.14). 
The parameter values cluster together because 
of the location optimization by random walk in 
the data assimilation system (Line 8228-8243 
in TECO2.3.1f90).The location optimization 
technique generates new parameter values near 
the parameter values of the previous simulation 
by narrowing parameter spaces with a distance 
criterion. The purpose is to accelerate the prog- 
ress of optimization. Similarly, model predic- 
tions of land carbon dynamics tend to cluster 
near the means rather than distribute uniformly 
(Figure 32.1C). 

As discussed in Chapter 29, the relative 
information content of a model on predictions 
and parameters can be quantified and com- 
pared before and after data assimilation using 
the Shannon information index. In brief, the 
entropy of null knowledge on parameters (or 
predictions), Hy, can be calculated as: 


Ho = log, n (32.1) 


where n is the number of bins of posterior 
parameter values. A bin is a subgroup of values 
within the posterior parameter range. Here we 
set n to 200, meaning we divide the param- 
eter range into 200 equally spaced intervals 
between the minimum and maximum bound 
of the posterior parameter distribution. We 


could choose another arbitrary value for n. 
Since the base is set to be 2, the unit of H would 
be bit. The entropy of the probability distribu- 
tion function of parameters (or predictions) 
generated by a model alone (or a model with 
assimilation of data), H(X,), is calculated as: 


n 


H(X,,)= X plini log Pxm) 


i=1 


(32.2) 


where X,, is a parameter (or a carbon pool 
or flux) simulated by the model alone, x, is the 
mean value of X,, in a bin. Relative information 
of the model (I,,) is calculated as: 


(32.3) 


Similarly, the entropy of the PDF of a parameter 
(or a carbon pool or flux) simulated by a model 
following assimilation of datasets, H(X,,), is 
calculated as: 


n 


H(Xma) = X (a) log: P(Xnas) 


i=1 


(32.4) 


where X,,, is a parameter (or a carbon pool 
or flux) simulated by the model after assimila- 
tion of the datasets, x,,,; is the mean value of X,,, 
in a bin. Relative information of the dataset(s), 
l, is calculated as: 


l = Ho —H(Xm) (32.5) 
Model biases in simulating C pools and fluxes 
can be quantified using the Root Mean Square 
Error (RMSE), which is calculated as: 


yo (Prediction, — Measurement, y 
RMSE = — N 


(32.6) 


PRACTICE 8 


where N is the number of measurements, 
Prediction, means the ith predicted value of a C 
pool or flux, Measurement, means the ith mea- 
sured value of a C pool or flux. Model infor- 
mation quantified using the above method is 
shown in Figure 32.1B. 


EXERCISE 2: Model run with assimilation 
of carbon pool measurements 


For this exercise, you select unit 8 and Exercise 
2 in CarboTrain. Select the output directory 
and click Run Exercise. The data assimilation 
system will read carbon pool measurements 
in the file input/SPRUCE cpool.txt 
and compare the measurements against the 
simulated carbon pools (Lines 7554-7597 in 
TECO_2.3.£90), in order to decide whether 
a set of parameter values should be accepted 
as posterior parameter values or not using the 
cost function. You can find model outputs in 
DA with cpools.zip. 

You can find three plots in the folder DA_ 
with cpools/plot/DAUnit8. These 
show comparisons between the results of the 
model when run without data assimilation (as 
in Exercise 1) and with assimilation of carbon 
pool measurements. The figures compare pos- 
terior parameter values (“2 Density C_ 
pools.png”, Figure 32.2A and B) and 
simulated carbon dynamics (mean values in 
va Simulation C pools.png” and 
“2 1 Simulation C pools uncer- 
tainty.png”, Figure 32.2C). You can see 
that the posterior distribution of specific leaf 
area is bell-shaped (Figure 32.2A). Information 
on specific leaf area is essentially higher follow- 
ing assimilation of carbon pool measurements 
(Figure 32.2B). Both results suggest that spe- 
cific leaf area is well informed by the carbon 
pool measurements. The turnover times of leaf 
and root are also informed by the carbon pool 


QUESTIONS: 


1. Will relative information change if you 
change the logarithm base and/or the num- 
ber of bins? 


2. Which indexes are used to indicate model 
bias and uncertainty in simulations? 


measurements (Figure 32.2A and B). Model 
parameters such as basal respiration rate of leaf, 
wood, and root are not strongly informed by 
the carbon pool measurements (Figure 32.2A 
and B). These results indicate that carbon pool 
measurements have much more information 
on specific leaf area and the turnover time of 
carbon pools compared to other parameters of 
the model. 

After assimilating the carbon pool mea- 
surements, both model bias and uncertainty 
in simulating leaf carbon pool are markedly 
reduced (Figure 32.2C), suggesting that sim- 
ulated leaf carbon pool is well informed by 
carbon pool measurements. Model biases and 
uncertainties in simulating GPP, Reco, and NEE 
also tend to decrease (Figure 32.2C), probably 
because of constrained photosynthesis rate 
and autotrophic respiration which are closely 
related to leaf carbon pool. Surprisingly, model 
biases and uncertainties in simulating wood 
and root carbon pools are not reduced after 
assimilating the carbon pool measurements 
(Figure 32.2C). This is due to the relatively 
short simulations and data time series, rela- 
tive to the timescale of vegetation structural 
development; carbon pool measurements can 
potentially constrain model predictions of 
carbon pools but need to be high frequency 
and sufficiently long term. Since carbon fluxes 
are proportional to carbon pool sizes in the 
TECO v2.3 model, model simulations of car- 
bon fluxes are also informed by carbon pool 
measurements. 
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Figure 32.2. Comparison of parameters and simulations before and after assimilation of carbon pool measurements. (A) Posterior distribution of parameter values. 
(B) Information gain. (C) Simulations after assimilating carbon pool measurements. Full name and units of parameters can be seen in “Source_code/TECO_2.3/ 
annotations in TECO model/parameter_file annotations.jpg”. In (C), point and error bar (smaller than the point size) indicate the mean and the 
estimated standard deviation of measurements, respectively. The point and error bar are shown in gray if they were not assimilated into the model and are shown in black 
if they were assimilated into the model. Solid line, deep gray band, and light gray band indicate the mean, first to third quantiles, and range of 50 simulations, respectively. 
Simulation uncertainty is shown only for GPP and foliage biomass C. RMSE indicates root mean square error. 


QUESTIONS: 


1. Which measurement has more information 
to constrain parameters related to leaf carbon 
pool? 


EXERCISE 3: Model run with assimilation of 
carbon flux measurements 


For Exercise 3, you select unit 8 and Exercise 3 
in CarboTrain. Select the output directory and 
click Run Exercise. The data assimilation system 
will read carbon pool measurements in the file 
input /SPRUCE cflux.txt and compare 
the measurements against the simulated carbon 
fluxes (Lines 7526-7549in TECO _2.3.f90), 
in order to decide whether a set of parameter 
values should be accepted as posterior param- 
eter values or not using the cost function. You 
can find model outputs in the folder DA 
with cfluxes. zip. There are three plots 
in the folder DA with cfluxes/plot/ 
DAUnit 8. These show comparisons between 
the results of the model when run without data 
assimilation (as in Exercise 1), with assimila- 
tion of carbon pool measurements (Exercise 2) 
and assimilation of carbon flux measurements. 
The figures compare posterior parameter val- 
ues (“3 Density C pools.png”, Figure 
32.3A and B) and simulated carbon dynamics 
(“3 Simulation C pools.png” and 
“32 Simulation C pools uncer- 
tainty.png”, Figure 32.3C). 

After assimilating the carbon flux mea- 
surements, several model parameters are well 
constrained (Figure 32.3A). The informed 
parameters are mainly related to photosynthe- 
sis, plant respiration, and autotrophic respira- 
tion, suggesting that carbon flux measurements 
are best at constraining model predictions of 


2. Why are parameters related to wood carbon 
pool and root carbon pool not informed by 
the carbon pool measurements used in this 
study? 


carbon fluxes. Moreover, parameters related 
to fast-cycling carbon pools (e.g., leaf carbon 
pool) are more informed by the carbon flux 
measurements than parameters related to slow- 
cycling carbon pools (e.g., stem carbon pool 
and soil passive carbon pool) (Figure 32.3A 
and B). This is expected because of the short 
period (six years) of the carbon flux measure- 
ments. With a longer period of measurements, 
and a longer simulation, model parameters 
related to slow-cycling carbon pools may be 
informed. 

After assimilating the carbon flux measure- 
ments into the TECO model, model biases and 
uncertainties in simulating NEE, GPP, and Reco 
are largely reduced (Figure 32.3C), suggesting 
that the carbon flux measurements can well 
inform model simulations of carbon fluxes. 
However, model biases and uncertainties in 
simulating carbon pools didn't decrease after 
assimilating the carbon flux measurements 
(Figure 32.3C). Overall, we see that a combina- 
tion of parameter values adjusted according to 
carbon flux measurements can result in good 
simulations of carbon fluxes but have little 
information on simulating carbon pools. 


QUESTIONS: 


1. Which model parameters are well informed 
by carbon flux measurements? 


2. Does yearly sum of carbon fluxes inform the 
model differently from daily sum of carbon 
fluxes? 
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Figure 32.3. Comparison among model runs without data assimilation, and with separate assimilation of pool and flux measurements. (A) Posterior distribution of parameter values. 
(B) Information gain. (C) Simulations after assimilating carbon pool measurements. Full name and units of parameters can be seen in “Source _code/TECO_2.3/annotations 
in TECO model/parameter file annotations.jpg”. In (C), black point and error bar (smaller than the point size) indicate the mean and the estimated standard 
deviation of measurements, respectively. Solid line, deep gray band, and light gray band indicate the mean, first to third quantiles, and range of 50 simulations, respectively. Simulation 
uncertainty is shown only for GPP and foliage biomass C. RMSE indicates root mean square error. 


EXERCISE 4: Model run with assimilation of 
carbon pool and flux measurements 


Based on the results of Exercises 2 and 3, 
you might expect the best constrained model 
parameters and predictions of carbon dynam- 
ics to be achieved by assimilating both pool 
and flux measurements into the model. We will 
explore whether that is the case in this exercise. 

Select unit 8 and Exercise 4 in CarboTrain. 
Select the output directory and click Run Exercise. 
The data assimilation system will read both car- 
bon pool and carbon flux measurements in the 
folder input and compare the measurements 
against the simulated carbon fluxes. You can 
find model outputs in the folder DA with_ 
cpools and cfluxes.zip. There are 
three plots in the folder DA_with_cpools 
and cfluxes/plot/DAUnit8. These 
show comparisons between the results of the 
model when run without data assimilation (as 
in Exercise 1), with assimilation of carbon pool 
measurements (Exercise 2), assimilation of car- 
bon flux measurements (Exercise 3) and assim- 
ilation of both pool and flux measurements. The 
figures are a comparison of posterior parameter 
values (“4 Density all.png”, Figure 
32.4A and B) and simulated carbon dynamics 
(“4 Simulation all.png” and “4 3 
Simulation C pools and fluxes 
uncertainty.png”, Figure 32.4C). 

As expected, assimilating both pool and flux 
measurements constrains model parameters 
and predictions more than assimilating no mea- 
surements or pools or fluxes separately (Figure 
32.4A and B). Optimized parameter values 
can differ between model runs when differ- 
ent datasets are assimilated (Figure 32.4A). For 
example, optimized values of specific leaf area 
are different between data assimilation with 
carbon pool measurements and data assimila- 
tion with carbon flux measurements. Similarly, 
optimized values of leaf turnover time are dif- 
ferent depending which of these datasets is 
assimilated into the model. These results sug- 
gest that we should be cautious when inter- 
preting optimized parameter values obtained 
by data assimilation. 

Several model parameters, for example, 
maximum growth rates of leaf, stem, and root, 
and turnover time of soil passive C pool, are 


not constrained by any available measurements 
(Figure 32.4A and B). This result suggests that 
additional measurements are needed to con- 
strain the model. For example, measurements 
of stem growth in the field may be collected to 
constrain the maximum growth rate of stem. 
Soil carbon pool size and carbon age indicated 
by '*C signature may be needed to constrain the 
turnover time of soil passive carbon pool, which 
is typically hundreds to thousands of years. 
Besides collecting more measurements, model 
simulations can be improved by simplifying the 
representation of model processes that cannot 
be constrained by available measurements. For 
example, litter carbon pool is separated into fine 
litter carbon pool and coarse litter carbon pool 
in the TECO model. However, we do not have 
any litter measurements to constrain the turn- 
over times of the two litter pools in this case. We 
might consider combining the fine and coarse 
litter pool to simplify the TECO model. 

After assimilating carbon pool and flux 
measurements into the TECO model, model 
simulations of carbon dynamics are gener- 
ally less biased than assimilating only carbon 
pool or flux measurements (Figure 32.4C). 
However, reductions of model biases by carbon 
pool measurements and carbon flux measure- 
ments are not additive. For example, the reduc- 
tion of model bias in simulating GPP with both 
carbon flux and carbon pool measurements is 
comparable to the reduction of model bias in 
simulating GPP with carbon flux measurements 
alone, although the carbon pool measurements 
also have some information on the simula- 
tion of GPP (Figure 32.4C). A combination 
of carbon pool and flux measurements helps 
reduce model bias in simulating NEE more 
than carbon pool measurements or carbon 
flux measurements alone, but the reductions 
are again not additive (Figure 32.4C). Finally, 
as expected, assimilating carbon pool and flux 
measurements into the model reduces model 
uncertainties in simulating both carbon pools 
and carbon fluxes (Figure 32.4C). 


QUESTION: 


1. Why are the information contents of pool 
measurements and flux measurements not 
additive? 
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Figure 32.4. Comparison among model runs without data assimilation and with assimilation of pool and/or flux measurements. (A) Posterior distribution of parameter 
values. (B) Information gain. (C) Simulations after assimilating carbon pool measurements. Full name and units of parameters can be seen in “Source _code/TECO_2.3/ 
annotations in TECO model/parameter file annotations.jpg”. In (C), black point and error bar (smaller than the point size) indicate the mean and the 
estimated standard deviation of measurements, respectively. Solid line, deep gray band, and light gray band indicate the mean, first to third quantiles, and range of 50 simula- 
tions, respectively. Simulation uncertainty is shown only for GPP and foliage biomass carbon. RMSE indicates root mean square error. 


EXERCISE 5: Real run of data assimilation 


Advanced readers may run data assimilation 
on their own computer and obtain the results 
shown in Figures 32.1-32.4.To obtain outputs 
corresponding to Exercise 1, you select unit 8 
and Exercise 5, set the output directory, and 
then you click Set Namelist and edit the namelist 
file by adjusting settings as follows: 


Use Eripe: Oo 


0 


use cpool_ob 


Click Run Exercise to run the model. Be aware 
that, depending on the specifications of your 
computer, it usually takes 8 hours or more to 
finish the model run. 

To obtain model output corresponding to 
Exercises 2, 3, and 4, edit the namelist file as 
follows: 


dla Esa Ca = True 
Exercise 2 Exercise 3 Exercise 4 
do_co2_da = True do_co2 da = True do_co2 da = True 
use _cflux_ ob = 0 use cflux ob = 1 use cflux ob = 1 
use cpool_ob = 1 use cpool_ ob = 0 use cpool ob = 1 


When all model outputs are available, you 
can visualize the results using the R script 
“Source code/TECO 2.3/R code 
for DA Unit 8.R” (modifications of work 
directory and file names may be needed). 

Beside the above exercises, you can also 
run data assimilation by changing initial 
parameter values (click Set Initial parameters in 
CarboTrain), selecting parameters for data 
assimilation and their ranges (click select DA 
pars). You can also assimilate methane, water, 


SUMMARY 


We have seen that different types of measurements 
can constrain different model parameters and 
model simulations of different carbon processes. 
The model run using default parameter values 
and ranges gives information about the model. 
Assimilating carbon pool measurements can poten- 
tially constrain carbon turnover times and model 
predictions of carbon pool sizes. Assimilating car- 
bon pool measurements may also constrain model 
predictions of carbon fluxes if carbon fluxes are 
modeled to be proportional to carbon pool sizes. 
Assimilating carbon flux measurements can con- 
strain model parameters related to carbon fluxes 
and model predictions of carbon fluxes but may 
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and temperature measurements into the model 
(click Set Namelist). Finally, if you want to assimi- 
late measurements at your own site into TECO 
v2.3, you will need to prepare climate forcing 
and your measurements in the format of files 
in Training Course/Source code/ 
TECO 2.3/input. If you are familiar 
with Fortran, you may also modify the model 
source code in TECO 2.3.f90 which can 
be found in the folder Training Course/ 
Source code/TECO 2.3. 


not be able to constrain model predictions of 
carbon pools. Assimilating both carbon pool and 
flux measurements can generally constrain model 
parameters and predictions more than assimilating 
either group of measurements alone, but the infor- 
mation contents of carbon pool and flux measure- 
ments may overlap with each other. Assimilating 
carbon measurements into a model can reduce not 
only model uncertainties but also model biases 
in simulating carbon dynamics during forward 
runs. Overall, data assimilation technique is a tool 
to extract information from measurements. It can 
guide the collection of measurements by experi- 
mentalists and be used by modelers to constrain 
model predictions of land carbon dynamics. 
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In this rapidly changing world, improving the 
capacity to forecast future dynamics of ecologi- 
cal systems and their services is essential for bet- 
ter stewardship of the earth system. This chapter 
introduces ecological forecasting, the next frontier 
of research in ecology. Using weather forecasting 
as an analog, this chapter discusses four elements 
for ecological forecasting. The four elements are: 
predictability of the land carbon cycle; observa- 
tions to constrain forecasting ; data assimilation to 
integrate data with models; and a workflow system 
to automate ecological forecasting. This chapter 
also describes applications of an ecological fore- 
casting system to a warming and CO, experiment 
in northern Minnesota and a precipitation mean 
and variance experiment in New Mexico. 


INTRODUCTION 


As you have seen in Figure 21.1, to realistically fore- 
cast ecosystem responses to environmental change, 
we need three elements: (1) model structure to rep- 
resent the real-world processes that control system 
functions; (2) parameterization to reflect system 
properties; and (3) external forcing variables that an 
ecosystem experiences. We have studied the matrix 
approach to process-based modeling for the carbon 
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cycle in units 1-5. This chapter will examine the 
predictability of the land carbon cycle, according 
to the matrix equation, to understand the expecta- 
tion of how well carbon forecasting can be achieved. 
We have also studied data assimilation for parameter 
estimation in units 6-8. This chapter will explore the 
availability of observations to achieve accuracy in 
ecological forecasting with different levels of model 
complexity. Training in this unit (i.e., this chapter, 
Chapter 34, and practice 9 in Chapter 35) is focused 
on a workflow system, Ecological Platform of 
Assimilating Data (EcoPAD) into model, to link real- 
time forcing and automate ecological forecasting. 


WEATHER FORECASTING 


Before we get into ecological forecasting, let us 
learn something from weather forecasting. Probably 
everyone is very familiar with weather forecast- 
ing. First, please take a moment to answer a few 
multiple-choice questions. How frequently do you 
look at weather forecasting? A. never; B. once every 
a few days; C. once a day; and D. a few times a day. 
You may make your own choice. Why do you look 
at weather forecasting? A. deciding what clothes to 
wear; B. deciding what kinds of outdoor activities 
to do; C. deciding whether you will do some field 
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research; or D. doing something else. What do you 
think the benefits are weather forecasting brings to 
society? A. saves lives; B. supports emergent man- 
agement; C. mitigates the impact of and prevents 
economic losses from high-impact weather; D. facil- 
itates financial revenue in energy, agriculture, trans- 
port, and recreational sectors; or E. all the above. 
These questions show that weather forecasting has 
become part of our lives, influences our daily activi- 
ties, and is relevant to many aspects of our society. 

According to a review paper by Peter Bauer 
et al. (2015) published in Nature, weather forecast- 
ing skills have been steadily improving. The skill 
reaches 98% by 2014 for a three-day weather fore- 
cast and about 60% for a seven-day weather fore- 
cast. I have personal experience on the accuracy of 
the weather forecast. Many of you may also notice 
how accurate the weather forecast has become. 

Numeric weather prediction as a scientific dis- 
cipline has been developing for more than one 
hundred years. The major milestones of weather 
prediction include knowing the laws of physics to 
make weather forecasting possible in 1901, devel- 
oping and using super-computing in the 1970s, 
using satellite and other observations in the 1980s 
and using data assimilation in the 1990s. For exam- 
ple, it is relatively well known that the physical pro- 
cesses that determine weather dynamics include 
energy and water fluxes, momentum dynamics, 
and land surface conditions, among others. 

Numeric weather prediction uses extensive 
observations from radar and other observations 
in data assimilation to generate weather patterns. 
Observations are used to constrain initial values 
every few hours. The data assimilation methods 
include 3D-var, 4D-var, and nowadays ensemble 
Kalman Filter. Data assimilation with complex 
weather models is computationally expensive. 
Accuracy and resolution of numerical weather pre- 
diction models increase over time as computational 
power exponentially increases. In short, success in 
weather forecasting depends on understanding of 
physical laws to develop models, collecting satel- 
lite and other data, using data assimilation to con- 
strain initial values every a few hours, and relying 
on supercomputing to carry out calculation of the 
numeric models. 

Similar to weather forecasting, ecological fore- 
casting also needs process-based models, obser- 
vations, data assimilation, and supercomputing. 
The process-based models offer model struc- 
ture whereas observations are assimilated into 
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the process-based models for parameterization 
through data assimilation via supercomputing. 


MODELS AND PREDICTABILITY OF THE 
TERRESTRIAL CARBON CYCLE 


Process-based models of the terrestrial carbon cycle 
have different levels of complexity but are examined 
in units 1—5 for their general properties through the 
matrix approach. One of the key properties of ter- 
restrial carbon dynamics is the convergence toward 
equilibrium over time, even if external forcing and 
disturbances often push the carbon cycle to be in 
disequilibrium. Using this intrinsic property, Luo 
et al. (2015) examined the predictability of the ter- 
restrial carbon cycle. While the rate of approach to 
equilibrium, and equilibrium itself, is relatively pre- 
dictable given knowledge about carbon input rates, 
loss rates, the initial conditions, and governing envi- 
ronmental constraints, there are three levels of pre- 
dictability: high, medium, and low, plus two cases: 
less known and unknown about the predictability 
for individual processes (see Table 33.1). 

For example, some external variables exhibit 
cyclic changes, typically causing the carbon flux 
rates, such as photosynthesis and respiration, to vary 
with the same period as the forcing (Table 33.1).The 
responses of terrestrial carbon to daily and seasonal 
cyclic forcing should be highly predictable. However, 
interannual variability in the terrestrial carbon cycle, 
as reflected in eddy-flux measurement and variations 
in the growth rate of atmospheric CO,, is less known 
for its underpinning mechanisms, making it difficult 
at present to evaluate its predictability. 

Disturbance events, such as wildfire and climate 
extremes themselves, however, have an inherent ran- 
dom component (e.g., chances of a hurricane) mak- 
ing the predictability of individual events relatively 
low. Likewise, the severity of disturbance impacts on 
the carbon cycle is not very predictable, either. The 
recovery dynamics following a disturbance, how- 
ever, appear to be highly predictable given adequate 
knowledge of the carbon influx rates, the residence 
times, and the pool sizes following disturbance 
(Table 33.1). Moreover, there is evidence that some 
ecosystems may recover to an alternative steady state 
following disturbance. Our lack of understanding of 
why this occurs limits our assessment of its conse- 
quences for carbon cycle predictability. 

Most of the direct effects of climate changes 
on the terrestrial carbon cycle can be predicted 
via relatively simple response functions in ESMs. 
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TABLE 33.1 


Intrinsic predictability of response patterns of the terrestrial carbon cycle to five classes of external forcing. The 
predictability of the carbon cycle measures a degree to which the response pattern is predictable given one class 
of external forcing. The predictability is usually judged by the sensitivity (e.g., diverging vs. converging) of systems 
behavior in response to various classes of perturbation and external forcing. In general, carbon cycle responses 
per se are more predictable than external forcing, which causes much high uncertainty in predicting carbon cycle 
responses to climate change 


External forcing 


Response of the terrestrial carbon cycle 


Class 


Cyclic environment 


Disturbance event 


Climate change 


Shifts in Disturbance 
regimes 


Ecosystem state 
change 


Example 


Diurnal, seasonal, 
and interannual 


Fire, land use, 
insect outbreak, 
and storms etc. 


Rising [CO,],, 
climate 
warming, 
altered 
precipitation 


Regional, long- 
term patterns 


of fire, land use, 


insect outbreak, 
and storm etc. 


Forest to cropland, 
grassland to 
cropland, 
reforestation, 
etc. 


General pattern 


Cyclic 


Pulse-recovery 


Gradual 


Disequilibrium 


Abrupt changes 


Component 


Diurnal and 
seasonal 


Interannual 


Time of events 
happening 

Immediate impacts 
of disturbance 
events on 
carbon cycle 


Recovery 


Recovery to 
original or new 
equilibrium 

Direct impacts 


Indirect impacts via 
induced changes 
in disturbance 
regimes and 
ecosystem states 

Joint probability 
to describe 
disturbance 
regimes and 
their shifts 


Impacts of shifted 
disturbance 
regimes on 
mean carbon 
storage 


When and where 


ecosystem states 
change 


Carbon cycle 
change with 
ecosystem states 


Intrinsic 
predictability 


High 


Less known 


Low 


Medium 


High 


Less known 


High 


Less known 


Unknown 


High 


Less known 


High 


From Luo et al. (2015) 
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However, climate change also causes indirect effects 
on the terrestrial carbon cycle, such as changes in 
plant species composition, microbial priming, and 
respiratory acclimation. The indirect effects are 
much less well understood, making it currently 
unclear just how predictable they are (Table 33.1). 
Moreover, climate change may also induce shifts 
in disturbance regimes and changes in ecosystem 
states, which is less predictable as discussed below. 

Disturbance regimes can be quantified by joint 
probabilistic distributions of disturbance fre- 
quency and severity, which, in turn, can be used 
to generate a probability distribution of ecosystem 
carbon storage. The mean of the probability dis- 
tribution determines the realizable carbon storage 
capacity under a given regime, reflecting the mean 
carbon storage capacity over a sufficiently long time 
period or over a sufficiently large area. This means 
carbon storage capacity could thus be predictable. 
However, we do not have enough knowledge to 
predict when the disturbance regime changes by 
direct or indirect anthropogenic forcing. 

When ecosystem states change, rates of carbon 
cycling among the plant, litter, and soil carbon 
pools also change. Given the change in vegeta- 
tion structures and corresponding parameters, a 
consequent change in the carbon cycle is quantifi- 
able. However, while vegetation state changes have 
been studied, their relationships with those carbon 
cycle parameters remain poorly understood. 

Overall, many processes of the terrestrial car- 
bon cycle are intrinsically predictable. For these 
processes, forecasting is expected to be highly 
achievable. However, the indirect effects of climate 
change on terrestrial carbon cycling become less 
predictable, especially those which, via changes in 
species composition and disturbance regimes, lead 
to ecosystem state changes. In addition, individ- 
ual disturbance events usually occur stochastically 
even within a stationary disturbance regime. For 
such processes, forecasting is expected to be more 
uncertain. 


DATA AVAILABILITY TO CONSTRAIN FORECAST 
VIA DATA ASSIMILATION 


There are plenty of data available to support fore- 
casting the carbon cycle. For example, eddy-flux 
networks provide half-hourly data streams over 
hundreds of sites to support near-term forecasting 
of carbon cycle dynamics over daily, seasonal, and 
interannual time scales. 


Data is also available on long-term processes, 
such as disturbance events and subsequent recov- 
ery. Forecasting disturbance events themselves is 
not easy at this stage. However, data is available 
to test forecasting of recovery processes. As dis- 
turbance regimes and their impacts on carbon 
cycle take place over quite long time scales, it is 
clear how real or near-term forecasting would be 
useful to research or management. But it is fea- 
sible to use available data to test the capability 
of forecasting. Plenty of data are also available on 
ecosystem state changes for studying ecological 
forecasting. 

There are many global change experiments 
ongoing right now. They offer great opportunities 
to test forecasting capability at some of the experi- 
mental sites. We have been forecasting responses 
of the carbon cycle to experimental treatments at 
five levels of temperature and two levels of atmo- 
sphere CO, concentration at the SPRUCE site since 
2016 (See Chapter 25). We are also setting up the 
forecasting system for a drought experiment at 
Sevilleta Long Term Ecological Research (LTER) 
site in New Mexico. 

Data from observational networks and experi- 
ments are integrated into models via data assimila- 
tion before ecological forecasting is made. Units 
6-8 describe basic concepts, procedure, and appli- 
cation cases of data assimilation. 


WORKFLOW SYSTEM TO FACILITATE 
ECOLOGICAL FORECASTING 


Ecological forecasting is usually carried out in an 
automatic fashion. We have developed a work- 
flow system for near-time forecasting. The sys- 
tem is Ecological Platform for Assimilating Data 
(EcoPAD) into models (Huang et al. 2019) (also 
see Chapter 34). EcoPAD is a software system that 
links sensor networks to ecological forecasting. It 
integrates eco-informatics, web-technology, eco- 
logical models, data assimilation techniques, and 
visualization. EcoPAD was designed to promote 
interactions among modeling, experimentation 
and observations to gain the best science. 

Data and models are integrated through a data 
assimilation system before the trained models are 
used for forecasting, optimization of measurement 
plans, and uncertainty analysis. The data to be inte- 
grated can come from the real-time sensor net- 
works or from spreadsheets with records of hand 
measurement (Figure 33.1). EcoPAD offers three 
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Figure 33.1. EcoPAD to streamline data ingestion from sensors and servers, model simulation, data assimilation, forecasting, 


and visualization. By timely updating of the parameter values, EcoPAD enables close interactions between experimenters and 


the modelers. 


options, which are simulation, data assimilation 
and forecast, in response to a request from users. 
Once a user makes a request, EcoPAD can execute 
the task automatically to generate results. The gen- 
erated results can be visualized in real-time as well. 
Thus, EcoPAD is an interactive software system for 
researchers to automatically execute model simu- 
lation, data assimilation, and ecological forecasting 
in real- or near-time. 

We have applied EcoPAD to the SPRUCE experi- 
mental site located in Northern Minnesota (see 
Chapter 25). SPRUCE is a whole ecosystem warm- 
ing and CO, enrichment project. It has five lev- 
els of temperature treatments and two levels of 
CO, concentrations. The experiment follows a 
gradient design with five chambers for five lev- 
els of temperature treatments at ambient CO, 
concentration and five chambers at elevated CO, 
concentration. The project is very well equipped 


with lots of real-time sensors and involves more 
than 100 scientists who perform many kinds of 
measurements. The real-time sensors send data to 
data servers. Data servers also store the data from 
hand measurements. EcoPAD automatically ingests 
data from data servers through a web app server 
for data assimilation and forecasting. The forecast 
results are automatically sent to two sites, one at 
the SPRUCE site and one at our lab website, for 
visualization. EcoPAD has done ecological fore- 
casting automatically at midnight on Saturday 
every week since June 2016. 

The forecasting variables include snow cover, 
soil thermal dynamics, and frozen depth (Huang 
et al. 2017); many carbon cycle variables, such as 
gross primary production, net primary produc- 
tion, net ecosystem production, and ecosystem 
respiration (Jiang et al. 2018); and methane flux 
and pathways (Ma et al. 2017). 
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We are applying EcoPAD for data assimilation 
and ecologist forecasting at Sevilleta Long-Term 
Ecological Research (LTER) site in New Mexico. 
Sevilleta LTER site currently has a few experiments 
going on. These experiments are mainly related to 
precipitation and nitrogen fertilization. In addition, 
there are several long-term eddy-flux towers, mea- 
suring carbon, water, and energy fluxes for years. 
We are developing the capability to do real-time or 
near-time data assimilation and ecological forecast- 
ing at those experimental and eddy-flux sites. 

In fact, EcoPAD can be used as a smart 
experiment-modeling system. First, the system 
can predict what ecosystems may respond to treat- 
ments once you have selected a site and decided 
your experimental plan. When we were writing a 
proposal to continue the LTER study at Sevilleta, 
New Mexico, we used the TECO model to do 
pre-experiment analysis on possible ecosystem 
responses to increasing variability in precipitation. 
The modeling results were included in the pro- 
posal. Once you get funding to do the experiment, 
you can use EcoPAD to assimilate the data you are 
collecting to constrain model forecasts. The model 
forecasts what ecosystem responses may likely be 
for the remaining period of your experiment. The 
forecast ecosystem responses can be used as refer- 
ences for you to design your measurement plan. At 
the SPRUCE project, Shuang Ma's forecast results 
stimulated discussion about how much methane 
may be released through bubbling. Discussion 
on this issue lasted for a few weeks. The exten- 
sive discussion led to improvement of Shuang's 
methane model, new ideas to design additional 
measurements of methane concentrations along 
the soil profiles, and more collaborations between 
experimental and modeling teams. Moreover, the 
uncertainty analysis with EcoPAD can tell us what 
those important datasets are and which need more 
measurement in order to understand the system 
dynamics. During the course of our study, we can 
use EcoPAD to periodically update the forecast by 
repeating steps from data assimilation to forecast- 
ing to improvement of measurement and model. 
We are using EcoPAD to do weekly forecasts at the 
SPRUCE project. During this process, we improve 
the models, the experiments, and the data assimi- 
lation system. 


In summary, carbon cycle forecasting is a new 
frontier of research in ecology. As many processes 
are highly predictable, forecasting carbon cycle 
dynamics is expected to be highly achievable. 
Plenty of data are available to constrain forecast- 
ing with data assimilation. Workflow systems to 
automate data-model integration and ecological 
forecasting are becoming available. The challenge 
remains to identify societally relevant issues that 
carbon cycle forecasting can address. 
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QUIZZES 


1. What are the elements leading to the success in 
weather forecasting? 


2. We need forcing variables to be consistent 
between models and field sites for realistic fore- 
cast because 


a. the model needs forcing variables to drive 
simulations 


b. it is easy to get forcing variables from field 
sites 


c. environment variables that drive the model 
prediction have to represent what ecosystems 
experience 


d. we can measure environmental variables at 
field sites 


3. What is a workflow system? 


4. Pre-experiment modeling analysis is useful 
because 


a. it gives some general ideas on what ecosys- 
tem responses may look like 


b. it will give us the precise prediction of eco- 
system responses 


c. we do not have to carry out the experiment 
anymore 


d. it helps us adjust the measurement plan 
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Tremendous scientific endeavors in ecology have 
been driven by the goal of forecasting future eco- 
logical dynamics. This chapter introduces the web- 
based Ecological Platform for Assimilating Data 
into model (EcoPAD) to facilitate ecological fore- 
casting. The objectives of this lecture are to under- 
stand why we need a platform like EcoPAD, the 
structure of the platform, and how to use EcoPAD 
to facilitate ecological forecasting. 


WHY DO WE NEED ECOPAD? 


Ecological research is driven by the quest of 
understanding the dynamics of biota and their 
interactions with the environment. To understand 
ecological patterns, processes and functions, we 
conduct field or manipulative experiments. From 
these experiments, we combine a wide range of 
approaches to obtain relevant observational data 
(e.g., through real-time sensor, laboratory mea- 
surements, remote-sensing and video-based obser- 
vations). For example, the National Ecological 
Observatory Network (NEON) of the United States 
monitors ecosystems across the United States by 
collecting hundreds of data products including 
organismal counts and measurements, water and 
soil quality, energy fluxes, and remotely sensed 
vegetation indices. Many similar observational 
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networks now provide us with a wide range of 
ecological relevant datasets for different regions. To 
name a few, FLUXNET tracks the exchanges of car- 
bon dioxide, water vapor, and energy between the 
biosphere and atmosphere for a network of flux 
tower sites around the world, while DroughtNet 
focuses on the responses of terrestrial ecosys- 
tems to drought. Knowledge obtained from data 
and experiments enables us to make inferences 
about ecological dynamics under novel situations. 
The inference could be based on complex math- 
ematical models built upon data as well as simple 
relationships derived from data. We call inference 
under novel conditions prediction. Forecasting is a 
type of prediction in which we make predictions 
about the future. 

Ecological forecasting is not only valuable for 
contributing to scientific advances but is also prac- 
tically valuable in guiding resource management 
and decision-making towards a sustainable future. 
The practical need for ecological forecasting is 
particularly urgent for our current rapidly chang- 
ing world, which is experiencing unprecedented 
food insecurity, natural resource depletion, biodi- 
versity loss, climate changes, and pollution of air, 
waters, and soils. This practical need has brought 
a growing number of forecasting-oriented stud- 
ies, for example, on fisheries, crop yield, species 
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dynamics, algal blooms, phenology, pollinator per- 
formance and biodiversity. Ecological forecasting 
is valuable in almost every subdiscipline of ecol- 
ogy. Recent progress especially in available data, 
improved process understanding, data assimilation 
techniques, and advanced cyber-infrastructure, is 
converging to transform ecological research into 
advanced quantitative forecasting. In practice, 
however, ecological forecasting remains largely 
aspirational, with the number of forecasting stud- 
ies lagging behind demand. One bottleneck is the 
lack of infrastructure to enable timely integration 
of data into models. EcoPAD is designed as a solu- 
tion to widen this bottleneck. It provides a fully 
interactive infrastructure to facilitate ecological 
forecasting, especially near-time ecological fore- 
casting based on iterative data—model integration. 

EcoPAD (https://ecolab.nau.edu/ecopad_por- 
tal/, last access: November 2020) serves to link 
ecological experiments and data with models and 
provides easily accessible and reproducible data- 
model integration with interactive web-based 
simulation, data assimilation, and forecasting. The 
system is designed to streamline web request- 
response, data management, modeling, prediction, 
and visualization to boost the overall throughput of 
observational data, promote data-model integra- 
tion, facilitate communication between modelers 
and experimenters, inform ecological forecasting, 
and improve scientific understanding of ecological 
processes. EcoPAD facilitates estimation of model 
parameter values, evaluation of model structure, 
assessment of information content of datasets, 
and understanding of uncertainties revealed by 
model-data fusion exercises. Additionally, EcoPAD 
automates data management, model simulation, 
data assimilation, ecological forecasting, and result 
visualization. It provides an open, convenient, 
transparent, flexible, scalable, traceable, and read- 
ily portable platform to systematically conduct 
data-model integration towards better ecological 
forecasting. The automated near-time ecological 
forecasting through EcoPAD updates periodically 
in a manner similar to weather forecasting. This 
design of EcoPAD enables it to function as a smart 
interactive model-experiment (ModEx) system 
(Figure 34.1). ModEx forms a feedback loop in 
which field experiments guide modeling and mod- 
eling influences experimental focus. Information is 
constantly fed back between modelers and experi- 
mentalists, and simultaneous efforts from both 
parties advance and shape understanding towards 
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better forecasts. ModEx can: (1) predict what an 
ecosystem’s response might be to treatments once 
the experimenter has selected a site and decided 
the experimental plan; (2) assimilate the data col- 
lected during the experiment to constrain model 
predictions; (3) project the expected ecosystem 
responses in the rest of the experiment; (4) tell 
experimenters which priority datasets to collect in 
order to better understand the system; (5) peri- 
odically update the projections; and (6) improve 
the models, the data assimilation system, and field 
experiments during the process. 

In addition to forecasting and facilitating 
interaction between modeling and experimental 
communities, EcoPAD is desirable because of the 
potential service it can bring to society. Forecasting 
with carefully quantified uncertainty is helpful in 
providing support for natural resource managers 
and policy makers. It is always difficult to bring 
complex mathematical ecosystem models to end 
users who do not have training in modeling, which 
creates a gap between current scientific advances 
and public awareness. The web-based interface of 
EcoPAD makes modeling as easy as possible with- 
out losing the connection to the mathematics, 
knowledge and data behind the models. In this 
way, infrastructure like EcoPAD has the potential to 
transform environmental education and encourage 
citizen science in ecology and climate change. 


GENERAL STRUCTURE OF ECOPAD 


The essential components brought together by 
EcoPAD include experiments and data, ecological 
models, data assimilation techniques, and the sci- 
entific workflow (Figure 34.1). 

Data are the foundation of ecological model- 
ing and forecasting. We have entered the “big 
data” age, characterized by the ready availability 
of different, often extensive datasets across vari- 
ous temporal-spatial scales. These datasets might 
have high temporal resolution, such as time series 
from real-time ecological sensors, or extensive 
spatial coverage from remote sensing sources and 
data stored in geographic information systems. 
Data may contain information related to environ- 
mental forcing (e.g, precipitation, temperature, 
incoming radiation), site characteristics (e.g., soil 
texture and species composition), or biogeochem- 
istry of soils and waters. EcoPAD offers systematic 
data management to digest diverse data streams. 
Datasets in EcoPAD are derived from research 
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Figure 34.1. Schema of approaches to forecast future ecological responses under (a) current practice; and (b) the Ecological 
Platform for Assimilation of Data (EcoPAD). Current practice makes use of observations to develop and/or calibrate models to 


make predictions. EcoPAD goes further by linking models to data through a formalized, iterative cycle using a fully interac- 


tive platform. EcoPAD consists of four major components: experiment and data, model, data assimilation, and the scientific 
workflow (green arrows or lines). Data and model are iteratively integrated through its data assimilation systems to improve 


forecasting. Its near-real time forecasting results are shared among research groups through a web interface to guide new data 


collections. The scientific workflow enables web-based data transfer from sensors, model simulation, data assimilation, forecast- 


ing, result analysis, visualization, and reporting, encouraging user-model interactions, especially for experimentalists and end 


users having a limited background in modeling. Adapted from Huang et al. (2019). 


projects in comma-separated value (csv) files or 
other loosely structured data formats. These data- 
sets are first described and stored with appropriate 
metadata via either manual operation or scheduled 
automation from sensors. Data are generally sepa- 
rated into two groups. One comprises forcing vari- 
ables to drive modeling, the other, observations 
used for data assimilation. Scheduled sensor data 
are appended to existing data files with prescribed 
frequency. Attention is given to how the particu- 
lar dataset varies over space and time. When the 
spatio-temporal variability is understood, it is then 
placed in metadata records that allow for query 
through EcoPAD’s scientific workflow. 

The workflow and data assimilation system 
of EcoPAD are relatively independent of any spe- 
cific ecological model. To illustrate the integration 
of models with EcoPAD, we take the Terrestrial 
ECOsystem (TECO) model as a general example. 
Linkages among the workflow, data assimilation 
system and ecological model are based on mes- 
saging. For example, the data assimilation system 


generates parameters that are passed to ecological 
models. The state variables simulated from ecolog- 
ical models are passed back to the data assimilation 
system. Models may have different formulations. 
As long as these models take in the same param- 
eters and simulate the same state variables, they are 
functionally identical from the point of view of the 
data assimilation system. TECO simulates ecosys- 
tem carbon, nitrogen, water, and energy dynamics. 
The original TECO model has four major submod- 
ules (canopy, soil water, vegetation dynamics, and 
soil carbon and nitrogen) (Weng and Luo, 2008) 
and is further extended to incorporate methane 
biogeochemistry and snow dynamics (Huang 
et al., 2017; Ma et al., 2017). 

Data assimilation (Chapter 21) provides a 
framework to combine models with data to esti- 
mate model parameters, test alternative ecological 
hypotheses through different model structures, 
assess the information content of datasets, quantify 
uncertainties, derive emergent ecological relation- 
ships, identify model errors, and improve ecological 
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predictions. Under the Bayesian paradigm, data 
assimilation techniques treat the model structure, 
the initial and parameter values as priors that rep- 
resent our current understanding of the system 
(Chapter 22). As new information from observa- 
tions or data becomes available, model parameters 
and state variables can be updated accordingly. The 
posterior distributions of estimated parameters or 
state variables are imprinted with information from 
the model, observations (or data) as the chosen 
parameters are constrained to reduce mismatches 
between observations and model simulations. 
Future predictions benefit from such constrained 
posterior distributions through forward model- 
ing. As a result, the probability density functions of 
predicted future states following data assimilation 
normally have narrower spreads than those without 
data assimilation. EcoPAD can accommodate differ- 
ent data assimilation techniques since the scientific 
workflow of EcoPAD is independent of the specific 
data assimilation algorithm. One example of a data 
assimilation method is the Markov chain Monte 
Carlo (MCMC) introduced in Chapter 22. 

The scientific workflow of EcoPAD wraps 
around one or more user-specified ecological 


Request 


Response 


— 


> 


models and data assimilation algorithms and acts 
to move datasets in and out of structured and cata- 
loged data collections (metadata catalog), while 
leaving the logic of the ecological models and data 
assimilation algorithms untouched (Figure 34.2). 
When a user makes a request through the web 
browser or command line utilities, the scientific 
workflow takes charge of triggering and executing 
corresponding tasks, pulling data from a remote 
server, running a particular ecological model, 
automating forecasting, or making the result eas- 
ily accessible and understandable to users through 
web based graphic displays (Figure 34.2). The 
workflow system is portable across operation sys- 
tem and programming language and is built to be 
scalable to meet the demands of the model and the 
end-user community. The essential components 
of the scientific workflow of EcoPAD include the 
metadata catalog, web application-programming 
interface (API), the asynchronous task or job queue 
(Celery), and the container-based virtualization 
platform (docker) (Figure 34.2). The workflow 
system of EcoPAD also provides structured result 
access and visualization. Scientific workflow is a 
relatively new concept in the ecology literature 
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Figure 34.2. The scientific workflow of EcoPAD. The workflow wraps ecological models and data assimilation algorithms with 
the docker containerization platform. Users trigger different tasks through the representational state transfer (RESTful) appli- 


cation-programming interface (API). Tasks are managed through the asynchronous task queue, Celery. Tasks can be executed 


concurrently on a single or more worker servers across different scalable IT infrastructures. MongoDB is a database software that 
takes charge of data management in EcoPAD, and RabbitMQ is a message broker. Adapted from Huang et al. (2019). 
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but is essential to realize real or near-real time 
forecasting. Thus, we describe it in detail below. 
Readers who are not interested in technical details 
may skip the following paragraphs and jump to the 
section: “Applications of EcoPAD: The Example of 
SPRUCE”. 

Datasets can be placed and queried in EcoPAD 
via a common metadata catalog, which allows for 
effective management of diverse data streams. The 
EcoPAD metadata scheme includes a description of 
the data product, access pattern, and time stamp 
of last metadata update. MongoDB (https://www. 
mongodb.com/, last access: November 2020), a 
NoSQL database technology, is employed to manage 
heterogeneous datasets to make documentation, 
query and storage fast and convenient. Through 
MongoDB, measured datasets can be easily fed into 
ecological models for various purposes such as to 
initialize the model, calibrate model parameters, 
evaluate model structure, and drive model fore- 
casts. For datasets from real-time ecological sensors 
that are constantly updating, EcoPAD can be set to 
automatically fetch new data streams with adjust- 
able frequency according to research needs. 

The “gateway” of EcoPAD is the Representational 
State Transfer (RESTful) application programming 
interface (API). It can deliver data to a wide variety 
of applications and enables a wide array of user 
interfaces and data dissemination activities. Once a 
user makes a request, such as through clicking on 
relevant buttons from a web browser, the request 
is passed through the RESTful API to trigger spe- 
cific tasks. Thus, the API bridges communication 
between the client (e.g, a web browser or com- 
mand line terminal) and the server (Figure 34.2). 
The API exploits the HyperText Transfer Protocol 
(HTTP) such that data can be retrieved and 
ingested from EcoPAD through the use of simple 
HTTP headers and verbs (e.g., GET, PUT, POST, 
etc.). Since HTTP is also understood by web serv- 
ers and clients, a user can incorporate summary 
data from EcoPAD into a website with a single line 
of html code. Users are able to access data directly 
through programming environments like R, 
Python, and MATLAB. Simplicity, ease of use, and 
interoperability are among the main advantages of 
the API, which enables web-based modeling. 

The task queue is a mechanism used to dis- 
tribute work across work units such as threads or 
machines. EcoPAD uses Celery (https://github. 
com/celery/celery, last access: November 2020) 
as an asynchronous task or job queue that runs in 


the background (Figure 34.2). Celery communi- 
cates through messages, and EcoPAD takes advan- 
tage of RabbitMQ (https://www.rabbitmq.com/, 
last access: November 2020) to manage messag- 
ing. After the user submits a command, the request 
or message is passed to Celery via the RESTful API. 
These messages may trigger different tasks, which 
include but are not limited to pulling data from 
a remote server where original measurements are 
located, accessing data through a metadata cata- 
log, running model simulations with user speci- 
fied parameters, conducting data assimilation that 
recursively updates model parameters, forecasting 
future ecosystem status, and postprocessing model 
results for visualization. The broker inside Celery 
receives task messages and hands out tasks to avail- 
able Celery “workers” that perform the actual 
tasks (Figure 34.2). Celery workers are in charge 
of receiving messages from the broker, executing 
tasks, and returning task results. The worker can be 
a local or remote computation resource (e.g., the 
cloud) that has connectivity to the metadata cata- 
log. Workers can be distributed into different infor- 
mation technology infrastructures, which makes 
the EcoPAD workflow expandable in accommodat- 
ing more computational resources. Each worker 
can perform different tasks depending on the tools 
installed in each worker. One task can also be dis- 
tributed to different workers. In such a way, the 
EcoPAD workflow enables the parallelization and 
distributed computation of actual modeling tasks 
across various IT infrastructures and is flexible in 
implementing additional computational resources 
by connecting additional workers. 

Another key feature that makes EcoPAD easily 
portable and scalable among different operation 
systems is the utilization of a container-based vir- 
tualization platform, the docker (https://www. 
docker.com/, last access: January 2019). The 
docker can run many applications that rely on 
different libraries and environments on a single 
kernel with its lightweight containerization. Tasks 
that execute TECO in different ways are wrapped 
inside different docker containers that can “talk” 
with each other. Each docker container embeds 
the ecosystem model into a complete file sys- 
tem that contains everything needed to run an 
ecosystem model: the source code, model input, 
run time, system tools, and libraries. Docker 
containers are both hardware-independent and 
platform-independent, and they are not confined 
to a particular language, framework, or packaging 
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system. Docker containers can be run from a lap- 
top, workstation, virtual machine, or any cloud 
compute instance. This is done to support the 
widely varied number of ecological models run- 
ning in various languages (e.g., MATLAB, Python, 
Fortran, C, and CCC) and environments. In addi- 
tion to wrapping the ecosystem model into a 
docker container, software applied in the work- 
flow, such as Celery, RabbitMQ, and MongoDB, 
are all lightweight and portable encapsulations 
through docker containers. Therefore, EcoPAD is 
readily portable to different environments. 

EcoPAD enables structured result storage, access, 
and visualization to track and analyze data-model 
fusion practice. Upon the completion of the model 
task, the model wrapper code calls a postprocess- 
ing callback function. This callback function allows 
model-specific data requirements to be added to 
the model result repository. Each task is associ- 
ated with a unique task ID and model results are 
stored within the local repository that can be que- 
ried by the unique task ID. The storage and query 
of model results are realized via the MongoDB 
and RESTful API (Figure 34.2). Researchers are 
able to review and download model results and 
parameters submitted for each model run through 
a web-accessible URL (link). The EcoPAD web 
page also displays a list of historical tasks (with 
URL) performed by each user. All current and his- 
torical model inputs and outputs are available to 
download, including the aggregated results pro- 
duced for graphical web applications. In addition, 
EcoPAD also provides a task report that contains an 
all-inclusive recap of submitted parameters, task 
status, and model outputs with links to all data 
and graphical results for each task. Such structured 
result storage and access make sharing, tracking, 
and referring to modeling studies instantaneous 
and clear. 


APPLICATIONS OF ECOPAD: THE EXAMPLE OF 
SPRUCE 


The SPRUCE experiments and datasets were intro- 
duced in Chapter 25. Here, we demonstrate the use 
of the EcoPAD infrastructure as a way to assimilate 
multiple streams of data from the SPRUCE experi- 
ment to the TECO model using the MCMC algo- 
rithm and forecast ecosystem dynamics in both 
near time and for the next ten years. A similar exam- 
ple was presented in Chapter 26. The forecasting 
system for SPRUCE is available at: https://ecolab. 


nau.edu/ecopad_portal/ (last access: November 
2020). From the web portal, users can check our 
current near- and long-term forecasting results, 
conduct model simulation, data assimilation, and 
forecasting runs, and analyze and visualize model 
results. We set up the system to automatically pull 
new data streams every Sunday from the SPRUCE 
FTP site that holds observational data and updates 
the forecasting results based on new data streams. 
Updated forecasting results for the following week 
are customized for the different manipulative treat- 
ments of the SPRUCE experiments and displayed in 
the EcoPAD-SPRUCE portal. At the same time, these 
results are sent back to SPRUCE communities and 
displayed together with near term observations for 
experimentalists to study. 

In the SPRUCE project, we take advantage of 
this platform to stimulate interactive communica- 
tion between modelers and experimentalists, study 
the acclimation of ecosystem carbon cycling to 
experimental manipulations, partition uncertainty 
sources in forecasting, improve the biophysical 
estimation for better forecasts, and explore how 
the updated model and data contribute to reliable 
forecasting. Our case studies confirm that realis- 
tic model structure, correct parameterization, and 
accurate external environmental conditions are 
critical for forecasting carbon dynamics. The fol- 
lowing section describes how the updated model 
and data contribute to reliable forecasting in the 
SPRUCE experiment. For other applications, the 
interested reader is referred to Huang et al. (2019). 

In the SPRUCE project, the system automati- 
cally conducts data assimilation with the new 
observational data streams from each week, suc- 
cessively improving the model parameterization. 
With constantly adjusted model and external 
forcing and weekly archived model parameters, 
model structure, external forcing, and forecast- 
ing results, the contribution of model and data 
updates to forecasting accuracy can be tracked by 
comparing the previous week's forecasted simula- 
tions to the current one informed by data from 
that week. Figure 34.3 illustrates how updated 
external forcing (compared to stochastically gen- 
erated forcing) and shifts in ecosystem state vari- 
ables shape the predictions. “Updated” means the 
real meteorological forcing monitored from the 
site’s weather station. In the absence of observa- 
tions, stochastically generated forcing is used as a 
proxy for future meteorological conditions. Future 
precipitation and air temperature are generated 
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Figure 34.3. Updated v. original forecasting of gross primary production (GPP; panels a, c) and soil organic C content (SoilC; 
panels b, d). The upper panels show three series of forecasts with updated v. stochastically generated weather forcing. “Updated” 


= forced by actual meteorology from field weather stations. Cyan lines indicate forecasting with 100 stochastically generated 


weather forcing timeseries from January 2015 to December 2024 (S1); red lines correspond to forecasting updating with mea- 
sured weather forcing from January 2015 to July 2016, followed by forecasting with 100 stochastically generated weather forc- 


ing timeseries from August 2016 to December 2024 (S2); blue lines show updated forecasting with measured weather forcing 


from January 2015 to December 2016, followed by forecasting with 100 stochastically generated weather forcing timeseries 
from January 2017 to December 2024 (S3). Panels (c) and (d) display mismatches between updated forecasting (S2, 3) and 
the original forecasting (S1). Red displays the difference between S2 and S1 (S2-S1), and blue shows the discrepancy between 


S3 and S1 (S3-S1). Dashed green lines indicate the start of forecasting with stochastically generated weather forcing. Note that 
panels (a) and (c) are plotted on a yearly timescale and panels (b) and (d) show results on a monthly timescale. Adapted from 


Huang et al. (2019). 


by vector autoregression using a historical data- 
set (1961-2014) monitored by the weather sta- 
tion. Photosynthetically active radiation (PAR), 
relative humidity, and wind speed are randomly 
sampled from the joint frequency distribution at a 
given hour each month. Detailed information on 
weather forcing is available in Jiang et al. (2018). 
TECO is trained through data assimilation with 
observations from 2011-2014 and used to fore- 
cast GPP and total soil organic carbon content at 
the beginning of 2015. 

For demonstration purposes, Figure 34.3 shows 
three series of forecasting results instead of updates 
from every week. Series 1 (S1) records forecasted 
gross primary production (GPP) and soil carbon 
with stochastically generated weather forcing from 
January 2015—December 2024 (Figure 34.3a, b, 
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cyan). Series 2 (S2) records simulated GPP and soil 
carbon with observed (updated) climate forcing 
from January 2015-July 2016 and forecasted GPP 
and soil carbon with stochastically generated forc- 
ing from August 2016—December 2024 (Figure 
34.3a, b, red). Similarly, the stochastically gener- 
ated forcing in Series 3 (53) starts from January 
2017 (Figure 34.3a, b, blue). For each series, pre- 
dictions were conducted with randomly sampled 
parameters from the posterior distributions and 
stochastically generated forcing. 100 mean values 
are displayed (across an ensemble of forecasts with 
different parameters) corresponding to 100 fore- 
casts with stochastically generated forcing. 

GPP is highly sensitive to climate forcing. The 
differences between the updated (S2, 3) and ini- 
tial forecasts (S1) reach almost 800 gC m? yr”! 
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(Figure 34.3c). The discrepancy is strongly damp- 
ened in the following 1-2 years. The impact of 
updated forecasts is close to zero after approxi- 
mately five years. However, the soil carbon pool 
shows a different pattern. The soil carbon pool is 
increased by less than 150 gC m”?, which is rela- 
tively small compared to the carbon pool size of 
ca. 62,000 gC m”. For soil, the impact of updated 
forecasts grows with time and is highest at the end 
of the simulation year 2024. GPP is sensitive to 
the immediate change in climate forcing, while 
the updated ecosystem state (or initial value) has 
a minimum impact on the long-term forecast of 
GPP The impact of updated climate forcing is rela- 
tively small for soil carbon forecasts during our 
study period. Soil carbon is less sensitive to the 
immediate change in climate compared to GPP 
However, the alteration of system status affects the 
soil carbon forecast, especially on a longer time- 
scale. Since we are archiving updated forecasts 
every week, we can track the relative contribu- 
tion of ecosystem status, forcing uncertainty, and 
parameter distributions to the overall forecasting 
patterns of different ecological variables and how 
these patterns evolve in time. In addition, as more 
observations of ecological variables (e.g., carbon 
fluxes and pool sizes) become available, it is fea- 
sible to diagnose key factors that promote robust 
forecasts by comparing the archived forecasts to 
observations and these to the model parameters, 
initial values, and climate forcing used. 

In addition to scientific capability, the EcoPAD 
system brings new opportunities to broaden 
user—model interactions and facilitate forecasting 
practice. The high complexity and long learning 
curve of ecological models and data—model fusion 
techniques frequently discourage researchers and 
impede progress in forecasting practice. EcoPAD 
is designed to reduce these hurdles. It can be 
accessed from a web browser and does not require 
any coding by the user, which opens the door for 
non-modelers to work with models. The online 


storage of results lowers the risk of data loss. The 
results of each model run can be easily tracked 
and shared with a unique ID and web address. In 
addition, the web-based workflow saves time for 
experts through automated model running, data 
assimilation, forecasting, structured result access, 
and instantaneous graphic outputs, allowing the 
researcher to focus on a thorough exploration of 
results. At the same time, advanced users have the 
flexibility to scrutinize or modify model code, 
embed a different ecological model, change the 
data assimilation algorithm or add new forecast- 
ing properties. 

In summary, EcoPAD provides an effective 
infrastructure with its interactive platform that 
rigorously integrates merits from models, observa- 
tions, statistical advance, information technology 
and human resources from experimentalists and 
modelers to practitioners and the general public. 
This facilitates progress in ecological forecasting 
and analysis. That being said, ecological forecast- 
ing and the EcoPAD platform are both at their early 
development stage. We need more creative ideas 
and community efforts to realize the potential of 
this promising field. 


SUGGESTED READING 
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QUIZZES 


1. What is ecological forecasting? 


2. What key challenges and barriers does EcoPAD 
help overcome? 


3. What are the four major components of EcoPAD? 


4. List four potential applications of EcoPAD? 


300 ECOLOGICAL PLATFORM FOR ASSIMILATING DATA (ECOPAD) FOR ECOLOGICAL FORECASTING 


CHAPTER THIRTY-FIVE 


Practice 9 


ECOLOGICAL FORECASTING AT THE 
SPRUCE SITE 


Jiang Jiang 


Nanjing Forestry University, Nanjing, China 


CONTENTS 


Introduction / 301 
Dataset Preparation for EcoPAD / 302 
Accessing and Working with EcoPAD-SPRUCE / 302 


This practice aims to help readers gain familiar- 
ity with ecological forecasting by using EcoPAD 
to perform ecological forecasting at the SPRUCE 
experiment. The web portal of EcoPAD-SPRUCE 
provides automated ecological forecasting at a 
weekly time scale. From the web portal, users can 
check current near- and long-term forecasting 
results, conduct model simulations, data assimila- 
tion, and forecasting runs, and analyze and visual- 
ize model results. There are three exercises in this 
practice, using either CarboTrain or EcoPAD. We 
delve into how constrained posterior parameters 
influence forecast uncertainty; how different forc- 
ing influences forecast uncertainty; and how eco- 
system forecast responds to warming and elevated 
CO, with fully specified uncertainties. 


INTRODUCTION 


To generate a realistic projection of terrestrial car- 
bon dynamics, we need to have three elements per- 
fectly aligned (Chapter 33). First, we need a good 
model structure, which represents underlying eco- 
system processes. Then, we need a good parameter- 
ization method, such as data assimilation discussed 
in units 6-8, to estimate parameter values. Finally, 
we also need a workflow system to link real-time 
forcings to the model (Chapter 34). In this practice, 


DOI: 10.1201/9780429155659-44 


we will focus on the third element, using EcoPAD to 
link forcing data to carbon cycle models that usually 
need climatic forcing to drive the projections. 

The Ecological Platform for Assimilating Data 
into model (EcoPAD) provides this functional- 
ity. EcoPAD is described in detail in Chapter 34. 
Traditionally, modelers usually tune a model, vali- 
date the model with data, and then generate model 
output as prediction. This is called forward model- 
ing. In addition to forward modeling, EcoPAD can 
perform data assimilation and ecological forecast- 
ing. EcoPAD links data and model via data assimila- 
tion to optimize parameter estimation. The system 
can update parameter estimations when new data 
becomes available to generate real-time forecast- 
ing. From an ensemble of parameters, EcoPAD can 
also produce an ensemble of forecasts, instead 
of just one projection. This method allows us to 
quantify the uncertainty of forecasting. The system 
can provide feedback and link the experimental 
and modeling communities, informing experi- 
mentalists on which data sets are needed to fur- 
ther improve model predictions; and modelers on 
which parts of a model are responsible for inac- 
curate or poorly constrained predictions, and may 
need to be improved. 

EcoPAD requires two categories of datasets, 
observation data and forcing data, to be able to 
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generate forecasts. Observation data is used to opti- 
mize the model parameters through data assimila- 
tion. Observations might include vegetation or soil 
inventory data, laboratory measurements, or high 
temporal resolution sensor data such as ecosystem 
flux measurements from the FLUXNET database. 
Evaluation of different datasets for their effective- 
ness in constraining model parameters is described 
in Chapter 29. The other category of data sets is 
used as forcing to drive a model. The temporal 
resolution of forcing data should be same as the 
time step of the model used for projections, and 
the data should cover the same time period as the 
historical and future part of the simulations. 

In this practice, EcoPAD is applied to the SPRUCE 
Experimental Forest in Northern Minnesota, USA. 
The SPRUCE experiment is described in detail 
in Chapter 25. SPRUCE is an ongoing project 
that focuses on long-term responses of north- 
ern peatland to climate warming and increased 
atmospheric CO, concentration. The project gen- 
erates a large variety of observational datasets that 
reflect ecosystem dynamics on different scales. 
These datasets are available from the project web 
page and file transfer protocol (FTP) site. EcoPAD 
accesses this FTP site directly and automatically to 
download the data. 


DATASET PREPARATION FOR EcoPAD 


In EcoPAD-SPRUCE, observational datasets are 
sourced from SPRUCE archives and stored in the 
EcoPAD metadata catalog for running the TECO 
model and conducting data assimilation. The forc- 
ing datasets to drive TECO in EcoPAD are hourly cli- 
mate data, which can be separated into two parts, 
namely past climate and projection of future cli- 
mate. Climate data are automatically downloaded 
from the SPRUCE FTP site every week. The future 
climate timeseries are generated as an ensemble of 
future climate using vector auto-regressive model- 
ing. Observational data can be updated any time 
when available. In this training version, however, 
we use pretreatment datasets from 2011 to 2014 
to investigate how constrained posterior param- 
eters influence forecasting uncertainty, and parti- 
tion the uncertainty sources. 

Pretreatment observations include three data 
sets of community-scale flux measurements (gross 
primary production, GPP; net ecosystem exchange, 
NEE; and ecosystem respiration, Reco) in 1.2 m 
internal diameter chambers, six data sets of plant 
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biomass growth and carbon content (foliage, wood 
and root), one data set of carbon in peat soil, and 
leaf phenological data. During 2011-2014, CO, 
flux observations were collected monthly during 
the growing season at ambient plots of the experi- 
mental site. A total of 30 data measurements were 
collected in August, September, and October 2011; 
May through November 2012; July, September, 
and October 2013; and June and July 2014. Three 
annual data points from 2012 to 2014 for plant foli- 
age, woody biomass and aboveground net primary 
production, NPP were estimated from inventory 
data. Biomass data were compiled by combining 
allometric data for shrubs, all ground layer species, 
and trees. Only one data point each was collected 
for fine-root and peat soil C. We collected leaf-out 
dates, which are calculated as growing degree-days 
above a threshold in the TECO model. The standard 
deviations reported in these data sets were also com- 
piled to estimate uncertainties for each data stream. 

The compiled observational data file SPRUCE _ 
obs . txt can be found in any EcoPAD simulation 
results under the input folder. It is also available 
in GitHub repository (https://github.com/ou- 
ecolab/teco_spruce/tree/master/input). The first 
column of the data file is “days”, which records 
the day number when data were collected, using 
January 1st, 2011 as the first day. The subsequent 
columns contain data values for each variable in 
the dataset. If the data are not available or missing, 
the value is shown as -9999. 

External forcing of the TECO model includes 
hourly climate data of photosynthetically active 
radiation (PAR), air temperature, soil temperature, 
precipitation, relative humidity, vapor pressure 
deficit, and wind speed. We generated an ensem- 
ble of 300 trajectories of ten year forcing variables 
from 2015 to 2024 as inputs for the forecasting 
period. The generated climate data are archived 
in the EcoPAD metadata catalog. For each opera- 
tional forecasting, the system updates climate data 
streams from the SPRUCE FTP site to replace sto- 
chastically generated forcing. 


ACCESSING AND WORKING WITH 
EcoPAD-SPRUCE 


EcoPAD-SPRUCE web portal, the forecasting system 
for SPRUCE, is available at https://ecolab.nau. 
edu/ecopad_portal/. From the web portal, users 
can check current near- and long-term forecast- 
ing results, conduct model simulations, data 
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Figure 35.1. The Custom Workflow web portal of EcoPAD applied for the SPRUCE project. Users can select among simulation, 
data assimilation and forecasting modes from the task drop-down box to run ecological models in the background. 


assimilation, and forecasting runs, and analyze and 
visualize model results. 

The front page of the EcoPAD-SPRUCE portal 
includes animation demos and a brief description 
of the system. The animation demos display the 
dynamic change of GPP, Reco, foliage carbon (foli- 
age C), wood carbon (wood C), root carbon (root C) 
and soil carbon (soil C) under 10 manipulative 
warming and elevated atmospheric CO, treatments. 
Each animation shows observations in the data 
assimilation period during which parameters are 
constrained (2011-2014) as well as model results 
with uncertainty from data assimilation and ten 
years forecasting from an ensemble of model runs. 
Warming generally increases GPP, Reco and carbon 
pools. Users can also get a sense of how uncertain- 
ties in forcing variables, such as light, temperature, 
and precipitation, that drive carbon fluxes in terres- 
trial ecosystems, affect uncertainty of predictions. 

Under the SPRUCE Forecasting menu, forecast- 
ing GPP and Reco is automated at weekly tim- 
escale. Mean trajectory and confidence interval 


EXERCISE 1: How do constrained 
posterior parameters influence forecasting 
uncertainty? 


There are two options to do this exercise: using 
either CarboTrain or EcoPAD. If you choose to 
use CarboTrain, please open the software, and 
select unit 9 and Exercise 1. An output folder 
should also be specified. Options allow you to 
set parameter initial values and ranges for data 
assimilation. Click Run Exercise to perform data 
assimilation to constrain parameters, followed 
by forecasting. Depending on the power of your 
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are derived from 100 stochastic simulations. In 
each simulation, parameters are randomly cho- 
sen from previous (or default) data assimilation 
tasks. Data assimilation is run irregularly, when- 
ever observation data is compiled manually, to 
constrain and update the parameters. Users can 
choose forecasting results from different warm- 
ing and elevated atmospheric CO, treatments in a 
drop-down box. 

Under the Custom Workflow menu, users can 
choose between three different modes to run the 
TECO model: Simulation, Data Assimilation (DA) and 
Forecasting (Figure 35.1). Each task requested by a 
user is assigned a unique task ID. Users can check 
information such as task ID, timestamp, param- 
eters, result status, and result URL from a web- 
enabled report once the task is submitted under the 
Task History tab. If the task status shows “SUCCESS”, 
users can check datasets relevant to task outputs 
from the result URL. The URL directs users to the 
location (result repository) where information 
related to model runs is stored. 


computer, it may take a long time to run data 
assimilation. You can run the data assimilation 
procedure for about ten minutes and then press 
‘Ctrl+C’ on your keyboard to produce a set of 
parameters that may not be well constrained. 
The system will continue to run the forecasting 
code and plot results. You can find all the results 
in the user-defined output folder. 

Alternatively, you may perform the task in 
the EcoPAD web version. Go to the webpage at 
https://ecolab.nau.edu/ecopad_portal_work- 
shop_v2/, and follow Ex 1.1, Ex 1.2 and Ex 1.3, 
below, to complete all the steps. 
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Ex 1.1 Running the task of forecasting e. Repeat the above steps for pool-based 
versus flux-based parameters and review 
a. Click the custom workflow, and choose the updated posterior distribution of 
Forecasting under the task drop-down box. parameters. 
b. Specify which parameters to use from f. Repeat Ex 1.1 using updated parameters 
previous DA. The system has preset to show forecasting results. 
default DA parameters. 
c. Run the exercise, and wait around ten 
QUESTION: 


minutes to get a “SUCCESS” run. 
d. Open the results link, and check the 
results in the Output Folder. 


What key differences are seen in the forecast- 
ing results following assimilation of pool- 
based versus flux-based parameters? 


QUESTIONS: Ex 1.3 Effects of different posterior distribution 


1. Do the trajectories of forecasting results of parameter values 


match the real data? 
a. Repeat Ex 1.2 to perform data assimila- 


tion on any of the 18 parameters. 

b. Repeat Ex 1.1, but choose other param- 
eters by clicking the button Set Data 
Assimilation instead of the default set. 
If data assimilation has not yet been 
conducted (Ex 1.2), the Data Assimilation 
Parameter Estimate Selection dialog will be 
empty. 

c. Check the results in the Output Folder, 
and compare the results with outputs 
from Ex 1.1. 


2. What's the difference between forecasting 
results among different variables, such as 
GPP, Reco, Foliage, Wood, and Soil? 


Ex 1.2 Running the task of data assimilation 


a. Select Data Assimilation (DA) in the task 
drop-down box. 

b. Click Set DA initial Parameters. There are 47 
parameters. Up to 18 may be chosen to 
participate in data assimilation. 

c. Click Run model, and wait around ten min- 
utes to get a “SUCCESS” run. 

d. Open the results link, and check the 
results in the Output Folder. 


Figure 35.2 lists some of the previous data 
assimilations. The metadata tags show what 


Data Assimilation Parameter Estimate Selection 


Set Data Assimilation to Default 


Select Task Name Timestamp 


teco spruce data assimilation ws 5/16/2019 TECO Spruce Data Assimilation, SLA, GLmax, GRmax, Gsmax, Vemx0, 
12:02:51  Tau_Leaf, Tau_Wood, Tau_Root, Tau _F, Tau_C, Tau_Micro, 
AM Tau_SlowSOM, Tau_Passive, gddonset, Q10, RIO, RSO, RrO 


4/26/2020 
9:55:28 
AM 


Metadata Tags 


ESA] teco soruce data assimilation ws TECO Spruce Data Assimilation, SLA, VemxO, Tau_Leaf, Tau Wood, 


Tau_Root, Tau F, Tau_C, Tau_Micro, Tau_SlowSOM, Q10 


O] toco spruce data assimilation ws 4/20/2020 TECO Spruce Data Assimilation, SLA, GLmax, GRmax, Gsmax, Vemx0, 
2:43:06 Tau_Leaf, Tau_Wood, Tau_Root, Tau_F, Tau_C, Tau_Micro, 


PM Tau_SlowSOM, Tau_Passive, gddonset, Q10, RIO, RsO, RrO 
5/8/2019 


PA teco spruce data assimilation ws TECO Spruce Data Assimilation, GLmax, GRmax, Gsmax, Vemx0, 


3:33:35 Tau_Leaf, Tau_Wood, Tau_Root, Tau_F, Tau_C, Tau_Micro, 
PM Tau_SlowSOM, Tau_Passive, gddonset, RIO, RsO, RrO 

teco spruce data assimilation ws 5/8/2019 TECO Spruce Data Assimilation, SLA, GLmax, GRmax, Gsmax, Vemx0, 
2:13:28 Tau_Leaf, Tau_Wood. Tau_Root, Tau_F, Tau_C, Tau_Micro, 


PM Tau_SlowSOM, Tau_Passive, gddonset, Q10, RIO, RsO, RrO 


teco spruce data assimilation ws 5/8/2019 TECO Spruce Data Assimilation, SLA, GLmax, GRmax, Gsmax, Vemx0, 
5:31:29 Tau_Leaf, Tau_ Wood. Tau_Root, Tau_F, Tau_C, Tau_Micro, 


ARA Tano CINCOM Tan Deriva addancat MIN DIN DEN Den 


Figure 35.2. Example of outputs from previous data assimilations. 
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Figure 35.3. Histogram of the posterior distribution of each parameter from a data assimilation run. 


parameters were chosen to do data assimila- 
tion. Users can select the constrained param- 
eters for forecasting. To check detailed results 
of previous data assimilation, users can find the 
datasets under the Task History tab according to 
the Timestamp. 

A unique feature of the data assimilation por- 
tal is that users can pick whatever parameters are 
to be constrained among the pool of 18 parame- 
ters. Users can change the initial value and range 
of any parameter to be used for data assimilation. 


EXERCISE 2: How does different forcing 
influence forecasting uncertainty? 


External forcing variables such as temperature, 
precipitation and radiation regulate various car- 
bon cycle processes (e.g., plant photosynthesis, 
water use, and soil organic matter decompo- 
sition), and therefore influence carbon stocks 
in different compartments of an ecosystem. A 
challenge for precisely predicting the future 
state of an ecosystem is the low predictability 
of future trajectories of forcing variables. This 
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Figure 35.3 shows an example of forecasting 
using posterior parameters when all the 18 
parameters were chosen for data assimilation. 


QUESTIONS: 


1. Do the forecasting results depend on param- 
eter posterior distributions? 


2. How do the forecasting results change 
depending on which parameters were con- 
strained by data assimilation? 


exercise addresses the implications of uncer- 
tainty in forcing for carbon cycle predictions. 
There are two options to do this exercise: 
using either CarboTrain or EcoPAD. To make the 
task simple, CarboTrain provides two sets of 
forcing variables, one is fixed forcing; the other 
is random forcing chosen from a forcing pool. 


a. Select unit 9 and Exercise 2 in the main 
window of CarboTrain. 

b. Choose fixed forcing under the ForcingVariables 
tab, click on Run Exercise. 
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c. Repeat the above steps but choose random 
forcing under the Forcing Variables tab. 


If you prefer to do the task in EcoPAD web ver- 
sion, go to the webpage at http: //ecolab.nau. 
edu/ecopad_portal_workshop_v2. 


a. Choose Forecasting with Different Forcing files 
under the Task drop-down box. 


EXERCISE 3: Uncertainty in forecasting 
ecosystem responses to warming and 
elevated CO, 


Forecasting is not very informative without 
fully specified uncertainties. In the SPRUCE 
experiment, forecasting is especially useful if it 
can help us to anticipate how long from the 
commencement of the experimental treatments 
it may take before carbon pool changes may be 
expected to be significantly different between 
temperature or CO, treatments. It takes time 
for carbon pool sizes to adjust in response to 
different treatments. In general, if the magni- 
tude of uncertainty in forecast of one pool is 
small, the statistical power to detect the treat- 
ment effects on the pool is large. And the time 
points to observe statistical difference among 
treatments is short, and vice versa. 

Again, there are two options to do this exer- 
cise: using either CarboTrain or EcoPAD. In 
CarboTrain, the steps are: 


a. Select unit 9 and Exercise 3 in the main 
window of CarboTrain. 

b. Click Set DA folder to specify which param- 
eters to use from previous data assimila- 
tion outputs. 

c. Enter the date for Forecast date end. The cur- 
rent forecasting allows any date up to 
2024/12/31. 


b. Choose fixed forcing under the Number of 
Forcing Sets tab, and run the exercise. 

c. Repeat the above steps but choose random 
forcing under Number of Forcing Sets tab. 


QUESTION: 


How do the forecasting results differ when 
using different forcing files, comparing fixed 
forcing and random forcing? 


d. Enter the warming treatment within the 
range 0.0°C—9.0°C, and CO, adjustment 
within the range 380-900 ppm. 

e. Click Run exercise and compare forecast- 
ing simulation results among different 
treatments. 


If you prefer to do the task in EcoPAD web ver- 
sion, go to the webpage at http://ecolab.nau. 
edu/ecopad_portal_workshop_v2. 


a. Click the custom workflow, and choose 
Forecasting under the task drop-down box. 

b. Click Set Data Assimilation to specify which 
parameters to use from previous data 
assimilation outputs. 

c. Under the Forecast Treatment tab, enter the 
warming treatment within the range 
0.0°C—9.0°C, and CO, adjustment within 
the range 380—900 ppm. 

d. Run the exercise and compare forecast- 
ing simulation results among different 
treatments. 


QUESTIONS: 
1. How do the ecological carbon dynamics 
respond to warming and elevated CO,? 


2. How do the uncertainties of the forecast vary 
with treatments? 
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In this chapter we introduce basic concepts and 
algorithms from machine learning. We explain 
how neural networks can be used for regression 
and classification problems, and how cross-valida- 
tion can be used for training and testing machine 
learning algorithms. 


INTRODUCTION AND APPLICATIONS OF 
MACHINE LEARNING 


Machine learning is the domain of computer sci- 
ence which is concerned with efficient algorithms 
for making predictions in all kinds of big data sets. 
A defining characteristic of supervised machine 
learning algorithms is that they require a data set 
for training. The machine learning algorithm then 
memorizes the patterns present in those training 
data, with the goal of accurately predicting simi- 
lar patterns in new test data. Many machine learn- 
ing algorithms are domain-agnostic, which means 
they have been shown to provide highly accu- 
rate predictions in a wide variety of application 
domains (computer vision, speech recognition, 
automatic translation, biology, medicine, climate 
science, chemistry, geology, etc.). 

For example, consider the problem of image 
classification from the application domain of 
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computer vision. In this problem, we would like 
a function that can input an image, and output an 
integer which indicates class membership. More 
precisely, let us consider the MNIST and Fashion- 
MNIST data sets (Figure 36.1), in which each input 
is a grayscale image with height and width of 28 
pixels, represented as a matrix of real numbers x 
E R?*28 (LeCun et al., 1998, Xiao et al., 2017). 
In both the MNIST and Fashion-MNIST data sets 
each image has a corresponding label which is 
an integer y € (0, 1, ..., 9}. In the MNIST data 
set each image/label represents a digit, whereas 
in Fashion-MNIST each image/label represents 


Learning Train Learned Predictions 
Algorithm data function on test data 
gm) =0 
Learn( 9) = 1 
9( 1) = 1 
h(Fi) =0 
Learn h(i) = 0 
h(i) = 1 


Figure 36.1. A learning algorithm inputs a train data set, and 
outputs a prediction function, g or h. Both g and h input a 
grayscale image and output a class (integer from 0 to 9), but 
g is for digits and h is for fashion. 
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a category of clothing (0 for T-shirt/top, 1 for 
Trouser, 2 for Pullover, etc). In both data sets the 
goal is to learn a function f: R?%28 > 40, 1,..., 9} 
which inputs an image x and outputs a predicted 
class f (x) which should ideally be the same as the 
corresponding label y. 

As mentioned above, a big advantage of super- 
vised learning algorithms is that they are typically 
domain-agnostic, meaning that they can learn 
accurate prediction functions f using data sets with 
different kinds of patterns. That means we can use 
a single learning algorithm LEARN on either the 
MNIST or Fashion-MNIST data sets (Figure 36.1, 
left). For the MNIST data set the learning algo- 
rithm will output a function for predicting the 
class of digit images, and for Fashion-MNIST the 
earning algorithm will output a function for pre- 
dicting the class of a clothing image (Figure 36.1, 
right). The advantage of this supervised machine 
learning approach to image classification is that 
the programmer does not need any domain-spe- 
cific knowledge about the expected pattern (e.g., 
shape of each digit, appearance of each cloth- 
ing type). Instead, we assume there is a data set 
with enough labels for the learning algorithm to 
accurately infer the domain-specific pattern and 
prediction function. This means that the machine 
learning approach is only appropriate when it is 
possible/inexpensive to create a large, labeled data 
set that accurately represents the pattern/function 
to be learned. 

How do we know if the learning algorithm is 
working properly? The goal of supervised learn- 
ing is generalization, which means the learned 
prediction function f should accurately predict 
f (x) =y for any inputs/outputs (x, y) that will be 
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seen in a desired application (including new data 
that were not seen during learning). To formalize 
this idea, and to compute quantitative evaluation 
metrics (accuracy/error rates), we need a test data 
set, as explained in the next section. 


K-fold Cross-Validation For Evaluating Prediction/ 
Test Accuracy 


Each input x in a data set is typically represented as 
one of N rows in a “design matrix” with D columns 
(one for each dimension or feature). Each output 
y is represented as an element of a label vector of 
size N, which can be visualized as another column 
alongside the design matrix (Figure 36.2, left). For 
example, in the image data sets discussed above we 
have N = 60,000 labeled images/rows, each with 
D = 784 dimensions/ features (one for each of the 
28 X 28 pixels in the image). 

The goal of supervised learning is to find a 
prediction function f such that f (x) = y for all 
inputs/outputs (x, y) in a test data set (which is 
not available for learning f). So how do we learn f 
for accurate prediction on a test data set, if that test 
set is not available? We must assume that we have 
access to a train data set with the same statistical 
distribution as the test data. The train data set is 
used to learn f, and the test data can only be used 
for evaluating the prediction accuracy/error of f. 

Some benchmark data sets which are used 
for machine learning research, like MNIST and 
Fashion-MNIST, have designated train/test sets. 
However, in most applications of machine learn- 
ing to real data sets, train/test sets must be cre- 
ated. One approach is to create a single train/test 
split by randomly assigning a set to each of the 
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Figure 36.2. K = 3 fold cross-validation. Left: the first step is to randomly assign a fold ID from 1 to K to each of the obser- 
vations/rows. Right: in each of the k € {1, . . . , K} splits, the observations with fold ID k are set aside as a test set, and the 
other observations are used as a train set to learn a prediction function (f1—f3), which is used to predict for the test set, and to 


compute accuracy metrics (A1-A3). 
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N rows/observations, say 50% train rows and 50% 
test rows. The advantage of that approach is sim- 
plicity, but the drawback is that we can only report 
accuracy/error metrics with respect to one test 
set (e.g, the algorithm learned a function which 
accurately predicted 91.3% of observations/labels 
in the test set, meaning 8.7% error rate). 

In addition to estimating the accuracy/error 
rate, it is important to have some estimate of vari- 
ance in order to make statements about whether 
the prediction accuracy/error of the learned func- 
tion f is significantly larger/smaller than other 
prediction functions. The other functions to com- 
pare against may be from other supervised learn- 
ing algorithms, or some other method that does 
not use machine learning (e.g., a domain-specific 
physical /mechanistic model). A common baseline 
is the constant function f (x) = y, where y, is the 
average or most frequent label in the train data. 
This baseline ignores all of the inputs/features 
x, and can be used to show that the algorithm is 
learning some non-trivial predictive relationship 
between inputs and outputs (for an example see 
Figure 36.4). 

The K-fold cross-validation procedure generates 
K splits, and can therefore be used to estimate both 
mean and variance of prediction accuracy /error. 
The number of folds/splits K is a user-defined 
integer parameter which must be at least 2, and 
at most N. Typical choices range from K= 3 to 10, 
and usually the value of K does not have a large 
effect on the final estimated mean/variance of 
prediction accuracy/error. The algorithm begins 
by randomly assigning a fold ID number (integer 
from 1 to K) to each observation (Figure 36.2, 
left). Then for each unique fold value from 1 to K, 
we hold out the corresponding observations/rows 
as a test set, and use data from all other folds as 
a train set (Figure 36.2, right). Each train set is 
used to learn a corresponding prediction function, 
which is then used to predict on the held-out test 
data. Finally, accuracy /error metrics are computed 
in order to quantify how well the predictions fit 
the labels for the test data. Overall, for each data set 
and learning algorithm the K-fold cross-validation 
procedure results in K splits, K learned functions, 
and K test accuracy/error metrics, which are typi- 
cally combined by taking the mean and standard 
deviation (or median and quartiles). Other algo- 
rithms may be used with the same fold assign- 
ments, in order to compare algorithms in terms of 
accuracy/error rates in particular data sets. 


For example, Figure 36.4 uses K = 4-fold cross- 
validation to compare four learned functions on an 
image classification problem. The accuracy rates of 
the “dense” and “linear” functions, 97.4 + 1.6% 
and 96.3 + 1.9% (mean + standard deviation) are 
not significantly different. Both rates are signifi- 
cantly larger than the accuracy of the “baseline” 
constant function, 16.4 + 1.4%, and smaller than 
the accuracy of the “conv” function, 99.3 + 1.1%. 
We can therefore conclude that the most accurate 
learning algorithm for this problem, among these 
four candidates, is the “conv” method (which uses 
a convolutional neural network, explained later). It 
is important to note that statements about which 
algorithm is most accurate can only be made for a 
particular data set, after having performed K-fold 
cross-validation to estimate prediction accuracy/ 
error rates. 


OTHER APPLICATIONS 


So far we have only discussed machine learning 
algorithms in the context of a single prediction 
problem, image classification. In this section we 
briefly discuss other applications of machine 
learning. In each application the set of possible 
inputs x and outputs y are different, but machine 
learning algorithms can always be used to learn a 
prediction function f (x) = y. Jones et al. (2009) 
proposed to use interactive machine learning 
for cell image classification in the CellProfiler 
Analyst system. This application is similar to the 
previously discussed digit/fashion classification 
problem, but with only two classes (binary clas- 
sification). In this context the input is a multi- 
color image of cell x € R'xwxe where h, w are the 
height and width of the image in pixels, and 
c = 3 is the number of channels used to represent 
a color image (red, green, blue). The output y € 
10, 1} is a binary label which indicates whether 
or not the image contains the cell phenotype of 
interest. 

Some email programs use machine learning 
for spam filtering, which is another example of 
a binary classification problem. When you click 
the “spam” button in the email program you are 
labeling that email as spam (y = 1), and when you 
respond to an email you are labeling that email as 
not spam (y = 0). The input x is an email message, 
which can be represented using a “bag-of-words” 
vector (each element is the number of times a spe- 
cific word occurs in that email message). 
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Russell et al. (2008) proposed the LabelMe 
tool for creating data sets for image segmentation, 
which is more complex than the previously dis- 
cussed image classification problems. In this con- 
text the input x € RiX"** is typically a multi-color 
image, and the output y E (0, 1)" is a binary 
mask (one element for every pixel in the image) 
indicating whether or not that pixel contains an 
object of interest. 

Machine learning can be used for automatic 
translation between languages. In this context the 
input is a text in one language (e.g., French) and 
the output is the text translated to another lan- 
guage (e.g., English). The desired prediction func- 
tion f inputs a French text and outputs the English 
translation. 

Machine learning can be used for medical diag- 
nosis. For example, Poplin et al. (2017) showed 
that retinal photographs can be used to predict 
blood pressure or risk of heart attack. Since the 
output y is a real number (e.g., blood pressure of 
120 mm mercury), we refer to this as a regression 
problem. 


AVOIDING UNDER/OVERFITTING IN A NEURAL 
NETWORK FOR REGRESSION 


In this section we begin by explaining the predic- 
tion function and learning algorithm for a simple 
neural network. We then demonstrate how the 
number of iterations of the learning algorithm can 
be selected using a validation set, in order to avoid 
underfitting and overfitting. 

We consider a simple regression problem for 
which the input x € R is a single real number (D = 
1 feature/column in the design matrix), and the 
output y € R is as well. Using a neural network 
with a single hidden layer of U units, two unknown 
parameter vectors are apparent which need to be 
learned using the training data, w € RY and v € R”. 
The prediction function f is then defined as: 


f(x) = w'o (xv) = w'z, (36.1) 


where o : RY > RY is a non-linear activation 
function, and z € RY is the vector of hidden 
units. Typical activation functions include the 
logistic sigmoid o(t) = 1/(1 + exp(-t)) and 
the rectifier (or rectified linear units, ReLU) o(t) 
= max(0, t). The prediction function is learned 
using gradient descent, which is an algorithm 


that attempts to find parameters w, v which 
minimize the mean squared error between the 
predictions and the corresponding labels in the 
N train data: 


(36.2) 


L(w,v) = LS [wo (xv) — y] 


i=1 


Gradient descent begins using uninformative 
parameters Wọ, Vọ (typically random numbers 
close to zero), then at each iteration t € {1,..., T } 
the parameters are improved by taking a step of 
size a > 0 in the negative gradient direction, 


W = Wu -AV y L(Ww,-1,V;-1) (36.3) 


vi = Vo -aV lL (wiv) (36.4) 


The algorithm described above is referred to 
as “full gradient” because the gradient descent 
direction is defined using the full set of N samples 
in the train set. Other common variants include 
“stochastic gradient” (gradient uses one sample) 
and “minibatch” (gradient uses several samples). 
When doing gradient descent on a neural network 
model, one “epoch” includes computing gradients 
once for each sample (e.g, 1 epoch = 1 iteration 
of full gradient, 1 epoch = N iterations of stochas- 
tic gradient). 

In the algorithm above, the number of hidden 
units U, the number of iterations T, and the step 
size a must be fixed before running the learn- 
ing algorithm. These hyper-parameters affect 
the learning capacity of the neural network. An 
important consideration when using any machine 
learning algorithm is that you most likely need 
to tune the hyper-parameters of the algorithm 
in order to avoid underfitting and overfitting. 
Underfitting occurs when the learned function f 
neither provides accurate predictions for the train 
data, nor the test data. Overfitting occurs when 
the learned function f only provides accurate pre- 
dictions for the train data (and not for the test 
data). Both underfitting and overfitting are bad, 
and need to be avoided, because the goal of any 
learning algorithm is to find a prediction func- 
tion f which provides accurate predictions in test 
data. 

How can we select hyper-parameters which 
avoid overfitting? Note that the choice of 
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hyper-parameters such as number of hidden units 
U and iterations T affect the learned function f, so 
we cannot use the test data to learn these hyper- 
parameters (by assumption that the test data are 
not available at train time). Then, how do we know 
which hyper-parameters will result in learned 
functions which best generalize to new data? 

A general method which can be used with any 
learning algorithm is splitting the train set into 
subtrain and validation sets, then using grid search 
over hyper-parameter values. The subtrain set is 
used for parameter learning, and the validation set 
is used for hyper-parameter selection. In detail, we 
first fix a set of hyper-parameters, say U = 50 hid- 
den units and T = 100 iterations. Then the sub- 
train set is used with these hyper-parameters as 
input to the learning algorithm, which outputs the 
learned parameter vectors w, v. Finally, the learned 
parameters are used to compute predictions f (x) 
for all inputs x in the validation set, and the cor- 
responding labels y are used to evaluate the accu- 
racy/error of those predictions. The procedure is 
then repeated for another hyper-parameter set, say 
U = 10 hidden units with T = 500 iterations. In 
the end we select the hyper-parameter set with 
minimal validation error, and then retrain using 
the learning algorithm on the full train set with 
those hyper-parameters. A variant of this method 
is to use K-fold cross-validation to generate K 
subtrain/validation splits, then compute mean 


validation error over the K splits, which typically 
yields hyper-parameters that result in more accu- 
rate /generalizable predictions (when compared to 
hyper-parameters selected using a single subtrain/ 
validation split). Note that this K-fold cross-vali- 
dation for hyper-parameter learning is essentially 
the same procedure as shown in Figure 36.2, but 
we split the train set into subtrain/validation sets 
(instead of splitting all data into train/test sets as 
shown in the figure). 

For example, we simulated some data with a 
sine wave pattern (Figure 36.3), and used the R 
package nnet to fit a neural network with one 
hidden layer of U = 50 units (Venables and Ripley, 
2013). We demonstrate the effects of under/ 
overfitting by varying the number of iterations/ 
epochs from T = 1 to 1000. In this example 
K = 4-fold cross-validation was used, so each 
data point was randomly assigned a fold ID inte- 
ger from 1 to 4. The result for only the first split 
is shown, so observations assigned fold ID=1 are 
considered the validation set, and other observa- 
tions (folds 2-4) are considered the subtrain set 
(which is used at input to the nnet R function 
which implements the gradient descent learning 
algorithm). We then used the predict function 
in R to compute predictions for subtrain and vali- 
dation data, and analyzed how the prediction error 
changes as a function of the number of iterations/ 
epochs T of gradient descent. The data exhibit a 
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Figure 36.3. Illustration of underfitting and overfitting in a neural network regression model (single hidden layer, 50 hidden 


units). Left: noisy data with a nonlinear sine wave pattern (grey circles), learned functions (colored curves), and residuals/ 


errors (black line segments) are shown for three values of epochs (panels from left to right) and two data subsets (panels from 


top to bottom). Right: in each epoch the model parameters are updated using gradient descent with respect to the subtrain loss, 


which decreases with more epochs. The optimal/minimum loss with respect to the validation set occurs at 64 epochs, indicat- 


ing underfitting for smaller epochs (green function, too regular/linear for both subtrain/validation sets) and overfitting for 


larger epochs (purple function, very irregular/nonlinear so good fit for subtrain but not validation set). 
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nonlinear sine wave pattern, but the learned func- 
tion for T = 4 iterations/epochs is mostly linear 
(underfitting, large error on both subtrain/vali- 
dation sets). For T = 512 iterations/epochs the 
learned function is highly non-linear (overfitting, 
small error for the subtrain set but large error for 
the validation set). When the error rates are plot- 
ted as a function of a model complexity hyper- 
parameter such as T (Figure 36.3, right), we see the 
characteristic U shape for the validation error, and 
the monotonic decreasing train error. The hyper- 
parameter with minimal validation error is T = 64 
iterations/epochs; smaller T values underfit or are 
overly regularized, and larger T values overfit or are 
under-regularized. 

Overall, in this section we have seen how a 
neural network for regression can be trained using 
gradient descent (for learning parameter vectors, 
given fixed hyper-parameters) and sub-train/ vali- 
dation splits (for learning hyper-parameter values 
to avoid under/overfitting). 


COMPARING NEURAL NETWORKS FOR IMAGE 
CLASSIFICATION 


In this section we provide a comparison of several 
other neural networks for image classification. In 
general, in a neural network with L-1 hidden lay- 
ers we can represent the prediction function as the 
composition of L intermediate f functions, for all 
layersl€ (1,..., Lp: 


ra) = [al] 


(36.5) 


Each of the intermediate functions has the same 
form: 


Ii (0) =A (wi), (36.6) 


where A, is an activation function and W, € RY"! 
is a weight matrix with elements that must be 
learned based on the data. This model includes sev- 
eral hyper-parameters which must be fixed prior 
to learning the neural network weights: 


* The number of layers L. 
e The activation functions A. 
+ The number of units per layer u}. 


e The sparsity pattern in the weight matrices 
W.. 


The number of units in the input layer is fixed, 
uy = D, based on the dimension of the inputs x 
€ RP. The number of units in the output layer u, 
is also fixed based on the outputs/labels y. The 
numbers of units in the hidden layers (u,, ..., u,_,) 
are hyper-parameters which control under/over- 
fitting. Increasing the numbers of hidden units 
u results in larger weight matrices W,, which 
in general means more parameters to learn, and 
larger capacity for fitting complex patterns in 
the data. The sparsity pattern of W, means which 
entries are forced to be zero; this technique is 
used in “convolutional” neural networks for 
avoiding overfitting and reducing training/pre- 
diction time. When the matrix is not sparse (all 
entries non-zero), we refer to the layer as dense 
or fully connected. 

For example, in the previous section we used 
a neural network for regression with one hidden 
layer, which in this more general notation means 
using L = 2 intermediate functions; the input 
dimension is uy = D = 1, the number of hidden 
units is u, = U = 50, and there is a single output 
u, = 1 to predict. The weight matrices are dense/ 
fully connected (no convolution/sparsity), of 
dimension W, € R**!, W, E R'**. The hidden 
layer activation function A, used by the R nnet 
package is the logistic sigmoid, o(t) = 1/(1 + 
exp(-t)), and the output activation for regression 
(real-valued outputs) is the identity, A,(t) = t. 

In this section we implement three other neu- 
ral networks for image classification. Using the 
“zip.train” data set of N = 7291 handwritten dig- 
its (Hastie and Tibshirani, 2009), each input is a 
greyscale image of 16 x 16 pixels which means 
that number of input units is uy = 256. As in Figure 
36.1 (top) there are ten output classes, one for 
each digit. For the activation function A, in the 
output layer we use the “softmax” function which 
results in a score/probability for each of the ten 
possible output classes, so the number of output 
units is u, = 10. 

The three neural networks that we consider are: 


linear L = 1 intermediate function with 2,570 
parameters to learn (linear model, inputs 
fully connected to outputs, no hidden 
units/layers). 


dense L = 9 intermediate functions with 97,410 
parameters to learn (nonlinear model, each 
hidden layer dense/fully connected with 
100 units). 
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sparse L = 3 intermediate functions with 
99,310 parameters to learn (nonlinear 
model, one convolutional/sparse layer fol- 
lowed by two dense/fully connected layers). 


We defined and trained each neural network 
using the keras R package (Allaire and Chollet, 
2020). We used the £it function with argument 
validation split=0.2, which creates a 
single split (80% subtrain, 20% validation). We 
selected the number of epochs hyper-parameter by 
minimizing the validation loss, and we used the 
selected number of epochs to re-train the neural 
network on the entire train set (no subtrain/vali- 
dation split). 

We did this entire procedure K = 4 times, once 
for each fold/split in K-fold cross-validation. Note 
that even though these data have a pre-defined split 
into “zip.train” and “zip.test” files, we used K-fold 
cross-validation on the “zip.train” file, yielding K 
train/test splits that we used to estimate mean and 
variance of prediction accuracy for these models 
(the “zip.test” file was ignored). In each split we 
used the test set to quantify the prediction accuracy 
of the learned models. It is clear that the test accu- 
racy Of all three neural networks is significantly 
larger than the baseline model which always pre- 
dicts the most frequent class in the train set (Figure 
36.4, left); they are clearly learning some non- 
trivial predictive relationship between inputs and 
outputs. Furthermore, it is clear from Figure 36.4 
(right) that the dense neural network is slightly 
more accurate than the linear model (p = 0.032 in 
paired one-sided t;-test), and the sparse/convolu- 
tional neural network is significantly more accurate 
than the dense model (p = 0.009). 

In summary, from this comparison it is clear 
that among these three neural networks, the sparse 
model should be preferred for most accurate pre- 
dictions in this particular “zip” data set. However, 


we must be careful not to generalize these con- 
clusions to other data sets — even for some other 
image classification data sets such as MNIST (Figure 
36.1), the most accurate algorithm may be differ- 
ent. For very difficult data sets, it may even be the 
case that these three neural networks are no more 
accurate than the baseline model which always 
predicts the most frequent class in the train set. 
In general, we always need to use computational 
cross-validation experiments to determine which 
machine learning algorithm is most accurate in 
any given data set. To learn a predictive model with 
maximum prediction accuracy, machine learning 
algorithms other than neural networks should be 
additionally considered (e.g., regularized linear 
models, decision trees, random forests, boosting, 
support vector machines). 


CROSS-VALIDATION FOR EVALUATING 
PREDICTIONS OF EARTH SYSTEM MODEL 
PARAMETERS 


As a final example application, we consider using 
cross-validation to evaluate a neural network that 
predicts carbon cycle model parameters (Tao et al., 
2020). In this context there is a data set with 
N = 26,158 observations, each one a soil sample 
with D = 60 input features. There are 25 real-valued 
output variables to predict; each is the value of an 
earth system model parameter at the location of 
the soil sample. We want a neural network that 
will be able to predict the values of these earth sys- 
tem parameters at new locations. Tao et al. (2020) 
proposed using a neural network with L = 4 fully 
connected layers and dropout regularization for 
this task (see paper for details). In this section 
the “multi-task” model uses the same number of 
layers/units as described in that paper; the term 
multi-task means that the neural network outputs 
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Figure 36.4. Prediction accuracy of functions learned for image classification of handwritten digits. The baseline function 


always predicts the most frequent class in the train set; other three learned functions are neural networks with different numbers 


of hidden layers (linear=0, conv=2, dense=8). 
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a prediction for all 25 outputs/tasks. For compari- 
son, we additionally consider “single-task” mod- 
els with the same number of hidden layers/units, 
but only one output unit. We expect the multi-task 
model to sometimes be more accurate, because of 
the expected correlation between outputs (earth 
system model parameters). To see whether or not 
these neural networks learn any nontrivial predic- 
tive relationship between inputs and output, we 
consider a baseline model which always predicts 
the mean of the train set label/output values (and 
does not use the inputs at all). 

Here we show how K = 5 fold cross-validation 
can be used to evaluate how well these neural 
networks predict each of the outputs at new loca- 
tions. We first assign a fold ID from 1 to 5 to each 
observation/row, either systematically using the 
longitude coordinate, or randomly (Figure 36.5, 
top). We can define a cross-validation procedure 
using both sets of fold IDs, in order to answer the 
question, “is it more difficult to predict at new 
longitudes, or new random locations?” We expect 
that predicting at new longitudes should be more 
difficult, because that involves more extrapolation 
(predicting outside the range of observed data val- 
ues). In detail, for each fold ID from 1 to 5, we 
define the test set as the data points which have 
been assigned that fold ID using both methods 
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output: cryo 
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c Single-task- e e . . . . eo 
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(longitude and random). For these data with N = 
26, 158 observations total, each fold has approxi- 
mately 5000 observations, so each resulting test 
set has approximately 1000 observations. As 
described in the last section on image classifica- 
tion, we used the R keras package to compute 
the neural network parameters and predictions 
(using a maximum of 100 epochs, and a single 
80% subtrain 20% validation split to choose the 
optimal number of epochs for re-training on the 
entire train set). For each fold/model/output we 
computed mean squared error with respect to 
the test set, and we plot these values for four of 
the 25 outputs (Figure 36.5, bottom). It is clear 
that some outputs are more difficult to predict 
than others; for cryo and maxpsi outputs the neu- 
ral networks show little or no improvement over 
baselines, whereas for tau4s3 and fs2s3 outputs 
we observed substantial improvements over base- 
lines. As expected, there is a difference in test error 
between fold assignment methods (random has 
lower error rates than Lon for several outputs), 
indicating that it is indeed easier to predict at 
new random locations, and harder to predict at 
new longitudes. Finally, the multi-task models are 
slightly more accurate than the single-task models, 
indicating that the neural network is learning to 
exploit the correlations between outputs. Overall 
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Figure 36.5. Cross-validation for estimating error rates of machine learning algorithms that predict earth system model param- 
eters. Top: fold IDs were assigned to each observation using longitude (left) or randomly (right). Bottom: prediction error for 


four of the 25 outputs. Please see (Tao et al., 2020) for meanings of abbreviations (cryo, maxpsi, tau4s3, fs2s3). 
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this comparison has shown how cross-validation 
can be used to quantitatively evaluate and compare 
machine learning algorithms for predicting earth 
system model parameters. 

In comparison to the neural network practice 
in unit 10, the main difference is that here we 
discussed how held-out test sets can be used to 
estimate prediction accuracy/error rates of learn- 
ing algorithms. Chapter 38 discusses how a valida- 
tion set can be used to avoid overfitting, as we have 
done in this chapter as well. We have additionally 
discussed how K = 5 fold cross-validation can be 
used to generate several train /test splits, which can 
be used to estimate prediction error rates for each 
fold/data/algorithm combination (e.g., Figure 
36.5, bottom). This technique is useful since it 
allows us to see which algorithms are significantly 
more/less accurate than others on given data sets. 


SUGGESTED READING 


Machine learning is a large field of research with many 
algorithms, and there are several useful textbooks that 
provide overviews from various perspectives: 
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QUIZZES 


1. When using a design matrix to represent 
machine learning inputs, what does each row 
and column represent? What other data/options 
does a supervised learning algorithm such as 
gradient descent need as input, and what does it 
yield as output? 


2. When splitting data into train/test sets, what is 
the purpose of each set? When splitting a train 
set into subtrain/validation sets, what is the pur- 
pose of each set? What is the advantage of using 
K-fold cross-validation, relative to a single split? 


3. In order to determine if any non-trivial predic- 
tive relationship between inputs and output has 
been learned, a comparison with a baseline that 
ignores the inputs must be used. How do you 
compute the baseline predictions, for regression 
and classification problems? 


4. How can you tell if machine learning model 
predictions are underfitting or overfitting? 


5. When using the nnet function in R to learn a 
neural network with a single hidden layer, do 
large or small values of the number of iterations 
hyper-parameter result in overfitting? Why? 


6. When using the nnet function in R learn a neu- 
ral network with a single hidden layer, and you 
do not yet know how many iterations to use, 
what data set should you use as input to nnet? 
How should you learn the number of iterations 
to avoid underfitting and overfitting? After hav- 
ing computed the number of iterations to use, 
what data set should you then use as input to 
nnet to learn your final model? Hint: possible 
choices for set to use are all, train, test, subtrain, 
validation. 
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This chapter describes a PROcess-guided deep 
learning and DAta-driven modeling (PRODA) 
approach to optimize parameterization of Earth 
system models (ESMs) using spatio-temporal 
datasets. PRODA involves both data assimilation 
to estimate parameter values and deep learning 
to predict spatial and temporal distributions of 
parameter values so as to optimize ESM prediction. 
An application to the Community Land Model ver- 
sion 5 (CLM5) using soil organic carbon (SOC) 
distributions in the conterminous United States 
illustrates the potential and utility of the PRODA 
approach. 


THE NEED FOR OPTIMIZING 
PARAMETERIZATION OF EARTH SYSTEM 
MODELS 


Earth system models (ESMs) are used to simulate 
historical and potential future states of climate and 
ecosystems. However, simulations often deviate 
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substantially from observations. For example, soil 
carbon dynamics simulated by ESMs vary widely 
among models and often fit poorly with observa- 
tions. Modeled global soil carbon storage differs 
by up to six-fold among 11 models of the Coupled 
Model Intercomparison Project Phase 5 (CMIP5) 
ensemble (Todd-Brown et al. 2013). None of the 
models reproduces the spatial distribution of SOC 
stocks presented in the Harmonized World Soil 
Database (HWSD) (Luo et al. 2015). 

Uncertainty in simulating SOC dynamics with 
ESMs could stem from poor parameterization, 
incorrect model structure, or biased external forc- 
ing (Luo and Schuur 2020, chapter 33). While 
model structure represents ecological processes 
(e.g., decomposition of soil organic matter), 
parameters in ESMs characterize properties of the 
processes, such as baseline decomposition rate at 
reference temperature and moisture content, or 
sensitivity to these drivers. The choice of parame- 
ter values can strongly influence model projections 
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of SOC dynamics. Parameter values in the current 
generation of ESMs, however, are mostly deter- 
mined on an ad hoc basis. They may be derived from 
the results of field experiments, other models, or 
informed from scientific or grey literature (Luo 
et al. 2001), but rarely take into account the range 
of possible values encompassed by such sources. 

Data assimilation techniques to estimate param- 
eter values from observations were discussed 
and illustrated in earlier chapters (units 6, 7, 8). 
Parameter values constrained by data assimilation 
can improve SOC simulation in ESMs compared 
to the default parameter values. For instance, the 
global representation of SOC distribution in the 
Community Land Model version 3.5 (CLM3.5) 
was improved from explaining 27 to 41% of varia- 
tion in the HWSD database by constraining model 
parameters with a Bayesian Markov Chain Monte 
Carlo (MCMC) data assimilation method (Hararuk 
et al. 2014). The large unexplained variation in 
observed SOC with ESMs is partly due to a text- 
book concept that parameter values of a simulation 
model must be constant in contrast to variables 
that can vary over the time course of simulation 
(Forrester 1961). In reality, ecosystem properties, 
which parameters characterize in models, con- 
stantly evolve via acclimation and adaptation. In 
addition, a model, no matter how complex it is, 
can never represent all the processes of a system at 
resolved scales (Luo and Schuur 2020). Interactions 
of processes at unresolved scales with those at 
resolved scales should be reflected in model param- 
eters. Therefore, Luo and Schuur (2020) argue that 
parameter values in ESMs may have to vary over 
space and time (i.e., heterogeneous parameter val- 
ues) to represent changing properties of evolving 
ecosystems and unresolved processes. 

The advent of big ecological data provides a 
golden opportunity to reconcile model representa- 
tions with observations and quantify the spatial and 
temporal features of key parameters in soil carbon 
cycle simulation. Meanwhile, new techniques such 
as deep learning have been proposed to improve 
performance of ESMs (Reichstein et al. 2019). By 
constructing computational models with multiple 
processing layers and allowing the models to learn 
representations of data from multiple levels of 
abstraction (LeCun et al. 2015), deep learning tech- 
niques have promising applications in Earth sys- 
tem science, such as pattern classification, anomaly 
detection, regression, and space- or time-dependent 
state prediction (Reichstein et al. 2019). Exploration 
is warranted on how to properly employ deep 


learning techniques in reducing uncertainties of 
simulated carbon dynamics in ESMs. 

Here, we propose the PROcess-guided deep 
learning and DAta-driven modelling (PRODA) 
approach to estimate spatially and temporally 
heterogeneous parameter values for ESMs from 
extensive spatio-temporal datasets (“big data”) at 
regional or global scales. The PRODA approach 
estimates parameter values at individual sites via 
data assimilation and builds a deep learning model 
to upscale the site-level estimates of parameters 
to predict spatially heterogeneous parameters at 
regional and global scales so that modeled and 
observed SOC are maximally matched. 

In this chapter, we introduce the PRODA 
approach by using an extensive dataset of vertical 
soil profiles across the conterminous United States 
to optimize SOC representation by CLM5. We dis- 
cuss the PRODA-optimized model performance in 
representing SOC stock and its vertical and spatial 
distributions, and compare it with results of the 
default model simulation and after the data assimi- 
lation optimization. In particular, we highlight that 
the PRODA approach helps the process model to 
achieve the most precise SOC distribution ever rep- 
resented in ESMs. An accurate SOC representation 
in ESMs is critical to fully understand soil carbon 
feedbacks to future climate change. 


THE WORKFLOW OF PRODA 


Three fundamental components together formu- 
late the PRODA approach (Figure 37.1a), namely 
the process-based model, the site-level data assim- 
ilation, and the deep learning model. Process- 
based models with their predefined structure and 
default parameter values simulate SOC distribu- 
tions using meteorological forcing data. Data 
assimilation is used to estimate parameter values 
of a process-based model with soil carbon data at 
sites where the observations were made. The deep 
learning model is used to predict optimized site- 
level parameter values with their associated envi- 
ronmental variables. Eventually, the process-based 
model will apply the optimized parameter values 
upscaled by the deep learning model to simulate 
SOC distributions at regional or global scales. 
Process-based model: We use the matrix 
representation of the Community Land Model 
version 5 (CLM5) to facilitate data assimilation 
and model simulation in the PRODA approach 
(Figure 37.1b). CLMS is the latest version of CLM 
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Figure 37.1. Workflow of the PRODA approach. (a) PRODA optimally matches CLM5 as the process-based model (b) with 
vertical SOC profiles on the conterminous United States (c). We first assimilate data at each site into CLM5 to estimate its param- 
eters through the Markov Chain Monte Carlo method (MCMC). We further assemble the estimated site-level parameter values 
(i.e., the mean value of the posterior distribution after MCMC) as targets to be predicted by a multilayer neural network with 
environmental covariates in a deep learning model. The predicted parameters by the deep learning model are applied to CLM5 


to optimize model representation of SOC distribution. 


models (Lawrence et al. 2019). Its soil carbon 
module is similar to that in CLM4.5 (Koven et al. 
2013), except that it has an option to change the 
number of soil layers from a default of 20. In this 
example, we use ten soil layers with a vertical trans- 
formation among carbon pools from the surface 
to a maximum depth of 3.8 m as in CLM4.5. The 
soil carbon component of CLM5 includes carbon 
transfer among four litter pools (coarse woody 
debris, metabolic litter, cellulose litter, and lignin 
litter) and three soil organic carbon pools (fast, 
slow, and passive SOC) in each of ten layers, total- 
ing 70 pools. The thickness of soil layers increases 
exponentially from the surface layer (1.75 cm) to 
deep layers (151 cm), with a total depth of 3.8 m 
over the ten layers. Vertical carbon transfer between 
soil layers only occurs among the adjacent layers 
and represents both diffusive and advective carbon 
flux transportation caused by bioturbation and 
cryoturbation. The baseline advective rate of car- 
bon flux is set to zero in CLM5 as a default, and 
this is assumed in our example as well. 


We have discussed in units 1-5 that carbon bal- 
ance equations in land carbon models can be uni- 
fied to a matrix form. For CLM5, we use the matrix 
equation to describe carbon transfer among the 70 
pools with state variables X (t) as: 


ZO suo) a8 (xl) 
~V(t)X(t) 


where B is a vector (70X1) of partitioning coeffi- 
cients from C input to each of the pools (unitless), 
and u(t) is C input rate (gC m™ day 1). A repre- 
sents the transfer coefficients among litter and soil 
pools (unitless), including the transfer coefficients 
from four litter pools to three soil carbon pools 
as well as the transfer coefficients of SOC among 
soil carbon pools in the same layer. €(t) represents 
effects of environmental variables on decomposi- 
tion of litter and soil (unitless). It includes sca- 
lars of temperature (+), soil water (w), oxygen 
(éo), nitrogen (Ex), and depth (ép). K indicates 


(37.1) 
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the decomposition rate of SOC in different litter 
and soil carbon pools (day~'). V(t) represents SOC 
mixing among vertical soil layers through cryo- 
turbation or bioturbation (day~'). The t in paren- 
theses indicates that the corresponding element 
is time-dependent. At a steady-state of the carbon 
dx(t) 
cycle ( a 0 ), the SOC content of each carbon 


pool at each layer can be calculated as: 


x(t)=[A&(t)K-V(t)}" (-Bu(t)) (67.2) 


Soil carbon data and site-level data assimila- 
tion: We use vertical SOC profiles in the contermi- 
nous U.S. from the World Soil Information Service 
(WoSIS) dataset (www.isric.org) for the site-level 
data assimilation (Figure 37.1c). The depth of 
recorded SOC layers ranges from the surface to 
more than 3 metres. A total of 26,509 soil profiles 
with a total of 240,148 layers at different depths 
in the conterminous U.S. are available in this study. 

In addition, we use the mean values of global 
net primary productivity (NPP) from 2000 to 
2014 as carbon input (DAAC 2018). After running 
the CLM4.5 model to a steady-state by the pre- 
industrial climate forcing (version code of forcing 
database: 11 850CRUCLM45BGC), ten-year records 
of soil temperature and soil water potential of the 
conterminous U.S. were obtained from the model 
outputs. 

The site-level data assimilation constrains 
parameter values of CLM5 with one data set of a 
vertical SOC profile at each site with the Markov 
chain Monte Carlo (MCMC) method (as described 
in chapter 22).Three parallel chains are generated 
each containing a test run of 20,000 iterations 
and a formal run of 30,000 iterations. To effec- 
tively capture the vertical distribution pattern of 
soil content along the depths, we put weights 
to observations at different depths in calculating 
the discrepancy between modeled and observed 
SOC content (i.e., cost function). These weights 
decrease exponentially with the depth (i.e., weight, 
= elit! where i refers to the layer's soil depth 
in observations) except for the top layer and the 
bottom layer, where a weight of ten is assigned to 
accelerate calibrating the upper and lower bounds 
of the SOC distribution curve. To monitor the effi- 
ciency of the MCMC process, an acceptance rate 
threshold is set. For Markov chains whose accep- 
tance rate is higher than 50% or lower than 15%, 


the corresponding data assimilation results are 
rejected. After the MCMC process, the first half of 
the accepted parameter values in the formal run 
are discarded as burn-in. The Gelman-Rubin sta- 
tistics of each parameter are then calculated for 
each soil profile to ensure the convergence of 
these three independent MCMC results. We ran- 
domly select one chain after eliminating the burn- 
in period to generate the posterior distributions 
of parameters. The mean value of the parameter’s 
posterior distribution is calculated and chosen to 
serve as the training target in the deep learning 
model. 

We evaluate the effectiveness of the site-level 
data assimilation by the coefficient of efficiency: 


x (obs; — mod; y 
X; (obs, — obs) 


(37.3) 


where obs, and mod, are the observed and modeled 
SOC content at ith soil layer of one soil profile; obs 
is the mean value of observed SOC content of the 
soil profile. In this study, we take profiles having 
negative E values as invalid and discard the results 
from the corresponding deep learning model. 
Moreover, at those sites where an observation is 
available at only one soil depth, we do not apply 
the data assimilation to the data point. After those 
data sets are excluded, 25,444 out of 26,905 soil 
profiles, or 94.6% of the entire dataset, are used in 
the PRODA approach. 

Deep learning model: We design a deep learn- 
ing model with multiple processing layers to predict 
optimized parameter values with environmental 
covariates. A total of 60 environmental variables 
that describe the climatic, edaphic and vegetation 
features at the observational sites is used. We used 
80% of the total dataset to train and validate the 
neural network. After model training, we use the 
remaining 20% of the dataset to quantify the pre- 
diction accuracy of the deep learning model. The 
predicted parameter values are first compared with 
those retrieved in site-level data assimilation and 
then applied to the matrix CLM5 model to simu- 
late soil organic carbon stock at each observational 
site. Meanwhile, we used the trained deep learning 
model to generate parameter maps across the United 
States based on gridded environmental covariates. 
The parameter maps are then applied to the matrix 
CLMS to simulate the SOC distributions across the 
United States at a resolution of 0.5 degrees. 
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SOC distributions optimized by data assimi- 
lation: To analyse the significance of the spatially- 
explicit parameter estimation of PRODA compared 
with a traditional approach, we perform a batch 
data assimilation using all the observational data- 
set as one batch in the MCMC method to estimate 
parameter values of CLM5 with data assimilation. 
The estimated parameter values from this method 
are spatially homogeneous, in contrast with the 
site-level data assimilation, which is a middle step 
of the PRODA approach to estimate spatially hetero- 
geneous parameter values. SOC distributions simu- 
lated by CLMS, trained by the batch data assimilation 
versus the results by PRODA, can then be compared. 

The batch data assimilation runs three parallel 
MCMC chains, each containing 50,000 iterations 
as test run and 200,000 iterations as formal run. 
Weights at different depth in calculating the cost 
function and acceptance control are the same as 
those in the site-level data assimilation. After the 
MCMC method, we first discard the first half of 
the accepted parameter values of the formal run 
as burn-in. The Gelman-Rubin statistics for each 
parameter are then calculated to ensure the con- 
vergence of these three independent MCMC 
results. We randomly select one Markov chain 
after eliminating the burn-in period to generate 
the posterior distribution for each parameter. We 
then randomly sample parameter values from the 
posterior distributions 1,000 times and apply the 
sampled parameter values to the CLMS matrix 
model. We estimate SOC content distributions at 
different sites by calculating the average of the 
results. The same sampled parameter values are 
further assigned in CLM5 to estimate SOC content 
distributions at each grid cell on the map of the 
conterminous US at a resolution of 0.5 degrees. 

Reference SOC data products: We use two 
sets of SOC data, WISE30sec and SoilGrids250m 
(Hengl et al. 2017), as references to compare with 
spatial and vertical distributions of SOC obtained 
from our study over the United States. WISE3 0sec 
is an updated version of the dataset HWSD, gen- 
erated by using traditional mapping methods at a 
resolution of 30 x 30 arc sec. SoilGrids250m is a 
global gridded soil information dataset generated 
by using machine learning techniques at 250 m 
resolution. We took data of SOC content over three 
depth intervals from these two datasets, 0-30 cm, 
0-100 cm and 0-200 cm. All the original data 
with high resolution were resampled to a resolu- 
tion of 0.5 x 0.5 degrees. 


MODEL REPRESENTATION OF SOC CONTENT 
ACROSS OBSERVATION SITES 


The original CLM5 model with default param- 
eterization presents significant geographical biases 
on the estimation of SOC content in comparison 
with observations. Modeled SOC in the grid cell in 
which the site of observation was located is com- 
pared with observations (Figure 37.2a). SOC stor- 
age is systematically overestimated by the original 
model near the east and west coasts of the U.S. but 
underestimated in the Midwest. The consistency 
between observed and modeled SOC content is 
low, with R? = 0.32 and RMSE = 15.9 kgC m”* 
(Figure 37.2b and Table 37.1). 

The batch data assimilation method generates 
the distribution of SOC from continentally homo- 
geneous posterior distributions of parameters esti- 
mated from all the observation data at once in data 
assimilation. With the batch data assimilation, the 
mismatch between observed and modeled SOC 
content in the CLM5 model is moderately reduced 
in the north and east parts of the US. (Figure 
37.2c). However, geographical biases in model 
representation of SOC are not eliminated. CLM5 
optimized by the batch data assimilation still 
underestimates SOC storage in the Intermontane 
Plateaus and southern Great Plains. Meanwhile, 
overestimation still exists in the Great Lakes areas 
and the Northeast. Overall, CLM5 after optimiza- 
tion by the batch data assimilation explains 43% 
variation in the observed SOC content with RMSE 
= 11.4 kg Cm (Figure 37.2d and Table 37.1). 

Through the deep learning model, the PRODA 
approach predicts the optimized parameter val- 
ues at each site across the conterminous U.S. by 
its environmental variables. PRODA-optimized 
CLM5 achieves a better representation of SOC 
distribution compared to the batch data assimi- 
lation. Little systematic geographical biases in 
estimating SOC storage are observed across the 
study domain (Figure 37.2e). The modeled and 
observed SOC content are highly correlated with 
R? = 0.62 and RMSE = 9.0 kg C m”* (Figure 37.2f 
and Table 37.1). 


SPATIAL DISTRIBUTION OF SOC ACROSS THE 
CONTERMINOUS U.S. 


We take point observations (Figure 37.3a—c) and 
estimations from WISE30sec (Figure 37.3d-f) and 
SoilGrids250m (Figure 37.3g—-i) as references 
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Figure 37.2. The agreement between observed and modeled SOC content with different approaches. SOC estimates modeled 
by CLM5 were extrapolated to the depths of observations to evaluate model performance. The upper panel indicates the devia- 


tion of the modeled SOC storage from the observation of the whole profile for each site. The lower panel shows the results of 


linear regression between observed and modeled vertical SOC content at different depths in different methods. In calculating 
the deviation of modeled SOC storage from observations, for better presentation, the positive (overestimation) and negative 


(underestimation) discrepancy between the observed and modeled SOC content were scaled based on the 95% quantile of the 


positive discrepancy and 5% quantile of the negative discrepancy, respectively. Meanwhile, only the results of the testing set 


were presented in PRODA approach. 


TABLE 37.1 
Performance of CLMs in representing SOC distribution 
under different approaches 


Model Performance 


Method R? RMSE (kg C/m3) 
Default CLM5 0.32 15.86 
Batch Data Assimilation 0.43 11.41 
PRODA Approach 0.62 8.95 


Note: R? is the coefficient of determination from linear regression 
between the observed and modeled SOC content. RMSE is the root 
mean square error. 


to compare the SOC estimations by CLM5 with 
default parameterization, optimized parameter- 
ization after the batch data assimilation, and the 
PRODA approach. At the continental scale, the ref- 
erence data suggest large volumes of SOC in the 
northeast and northwest of the conterminous U.S. 
The magnitude of SOC content in these regions 
can be as high as 30 kg C m"? for the 0-200 cm 
depth interval. Meanwhile, a decreasing gradient 
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of SOC from the northeast to the southwest is 
observed. High SOC exists in areas across the Great 
Plains, extending from Texas to the Great Lakes. 

The default CLM5 model (Figure 37.3j-l) cap- 
tures the continental SOC content gradient from 
the northeast to the southwest but fails to repro- 
duce sub-regional features of SOC distribution in 
the Great Plains. Meanwhile, SOC content in the 
east and northwest estimated by the original CLM5 
is significantly higher than that indicated by the 
reference data. After optimization by the batch 
data assimilation, CLM5 reproduces the continen- 
tal SOC gradient from the northeast to the south- 
west with reasonable values (Figure 37.3m-0). 
However, high SOC content in the Great Plains is 
still not well represented. The PRODA approach 
performs best overall, helping achieve the most 
realistic spatial SOC distribution (Figure 37.3p-r) 
in comparison with observations (Figure 37.3a—c) 
and data products (Figure 37.3d-i). In addition 
to capturing the continental SOC distribution pat- 
tern, the PRODA-optimized CLM5 presents more 
accurate subregional SOC distribution patterns in 
the Great Plains. 
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Figure 37.3. 


and datasets. 


VERTICAL DISTRIBUTION OF SOC ACROSS THE 
CONTERMINOUS U.S. 


We take results from WISE30sec and SoilGrids2 50m 
as references in estimated SOC stocks at different 
depth intervals (Figure 37.4). For the first 2-meter 
soil, WISE30sec suggests 243 PgC and SoilGrids250m 
estimates 269 PgC stored as SOC. Along the soil 
depth, WISE3 Osec suggests 98 PgC at 0-30 cm depth, 
81 PgC at 30—100 cm, and 64 PgC at 100-200 cm. 
SoilGrids250m estimates 102, 86, and 81 PgC at the 
same three depth intervals, respectively. 

The original CLM5 model with default param- 
eterization substantially overestimates SOC stocks 
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Modeled spatial SOC distributions in three depth intervals across the conterminous U.S. by different approaches 


in comparison with the references at all three soil 
depths (Figure 37.4). Compared with the refer- 
ences, the overestimation becomes stronger with 
increasing soil depth. Both the batch data assimila- 
tion and the PRODA approach help CLM5 estimate 
more reasonable SOC storage compared with the 
original CLM5 model. We estimate 165 PgC using 
the batch data assimilation and 246 PgC for the 
first 2-meter soils using the PRODA approach. 

For different vegetation types, the PRODA 
approach presents more accurate estimations of 
the vertical SOC distribution than the batch data 
assimilation (Figure 37.5). CLM5 underestimates 
the SOC content in the evergreen forest, shrubland, 
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Figure 37.4. SOC storage across the conterminous U.S. at different depths estimated by different approaches and data sources. 
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Figure 37.5. SOC storage for different vegetation types across the conterminous U.S. at different depths estimated by different 
approaches and data sources. The number in parentheses after vegetation type is the number of sites with that vegetation in the 
dataset used in this study. The error bars indicate +0.5 standard deviation. 
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savanna, grassland and wetland regions after the 
optimization by the batch data assimilation. The 
PRODA-optimized CLM5, in contrast, presents the 
least biased estimations in comparison with obser- 
vations at all depth intervals in the aforementioned 
regions. 


TOWARD MORE REALISTIC REPRESENTATIONS 
OF SOC DISTRIBUTION 


This chapter has systematically explored the signif- 
icance of spatially heterogeneous parameterization 
for the adequate prediction of SOC distribution in 
Earth system models, with CLM5 as a representative 
case. The results support the PROcess-guided deep 
learning and DAta-driven modelling (PRODA) as 
a promising approach to optimize model repre- 
sentation of SOC, utilising the explanatory power 
implicit in immense observational data. PRODA 
considers biogeochemical processes in the soil 
carbon cycle while preserving strong big data 
analysis ability to integrate soil data into complex 
models. We compared the PRODA-optimised SOC 
representation by CLM5 with the default model 
simulation and the results optimized by batch 
data assimilation and conclude that PRODA helped 
CLMS achieve the most accurate SOC representa- 
tion. Indeed, no better fit to reference data on SOC 
has ever been simulated by process-based models. 


In the past decades, different approaches have 
been developed for representation of SOC distri- 
bution (Figure 37.6). Soil scientists collect soil 
data and develop mechanistic understanding of 
soil carbon cycling from field observations or 
experiments. The simulation modeling approach 
conceptualizes those mechanisms into mathemati- 
cal equations and strives to simulate SOC accord- 
ing to process understanding. Notwithstanding 
the detailed description of carbon cycle processes, 
the models struggle to realistically simulate SOC 
distribution. Such unrealistic model simulations 
mainly arise from inadequate parameteriza- 
tion. Parameters that represent critical processes 
of the soil carbon cycle in the real world are not 
sufficiently constained with widely distributed 
observational data. Therefore, it is difficult for 
process-based models to accurately represent SOC 
distributions. In our example, CLM5 with default 
parameter values substantially overestimates the 
total SOC storage of the conterminous U.S. and 
presents strong geographical biases in the repre- 
sentation of SOC distribution. 

Batch data assimilation provides a way of incor- 
porating observational data information into the 
process model to improve SOC simulation. Such 
data-driven optimization harmonizes site-level 
data information as a whole to adjust the param- 
eter values for better representation of the SOC. We 
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Figure 37.6. Schema of different approaches to represent SOC distributions. The PRODA approach benefits from both process 
understanding (as featured by simulation modelling) with the real-world information brought out by big data analysis from 
machine learning. The latter is primarily to obtain accurate representations of the spatial distribution of SOC and its underlying 


mechanisms. 
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have shown in the example study that the opti- 
mized CLM5 with data assimilation successfully 
corrects the considerable overestimation of total 
carbon storage across our study domain. 

In terms of representing the spatial variability 
of SOC, however, batch data assimilation fails to 
capture the spatial variability of observed SOC. 
The spatially invariant parameter values optimized 
from the batch data assimilation approach are 
insufficient in describing the heterogeneity of SOC 
distribution at large scales. In our example study, 
geographical bias still exists after the optimization 
by the batch data assimilation. 

The PRODA approach solves the issue of geo- 
graphical bias by using a deep learning model 
to first fully estimate parameters at the site level 
using the data assimilation and then upscales the 
site-level estimates of parameters to the whole U.S. 
continent. The spatially varying parameter values 
retrieved from the PRODA approach contribute to a 
more accurate model representation of SOC across 
the range of ecosystem types (vegetation class, soil 
type, geology etc) across the continent. PRODA- 
optimized CLM5 simulates the most realistic SOC 
distribution ever simulated by process models. The 
high agreement between observed and modeled 
SOC content (R? = 0.623 across the conterminous 
U.S.) achieved by the PRODA approach is com- 
parable with that for harmonization mapping in 
SoilGrids250m by machine learning (R? = 0.635 
across the globe) (Hengl et al. 2017), and greater 
than the agreement between separate gridded 
empirical data products (Wu et al. 2019). 

More importantly, the PRODA approach paves 
the way for more mechanistic understanding of 
the soil carbon cycle from big data analysis with 


machine learning. Machine learning alone is good 
at accurately describing SOC distribution, yet pre- 
vious applications used in digital soil mapping 
focus only on the complex statistical relationship 
between environmental variables and SOC. The 
PRODA approach not only precisely maps SOC dis- 
tributions but also provides the spatial patterns of 
different mechanisms (as represented by different 
parameters) of the soil carbon cycle. In the future, 
disentangling how these mechanisms vary with 
environments and quantifying their importance 
to SOC storage will be essential for understanding 
terrestrial carbon dynamics and their feedbacks to 
climate change. 


SUGGESTED READING 


Tao, F., Z. Zhou, Y. Huang, Q. Li, X. Lu, S. Ma, X. Huang, 
Y. Liang, G. Hugelius, L. Jiang, R. Doughty, Z. Ren, 
and Y. Luo. 2020. Deep Learning Optimizes Data- 
Driven Representation of Soil Organic Carbon in 
Earth System Model Over the Conterminous United 
States. Frontiers in Big Data 3, 17. 


QUIZZES 


1. Whatis the main difference, in terms of parame- 
terization scheme, between the batch data assim- 
ilation and the PRODA approach as described in 
this chapter? 


2. Describe the input and output of the deep learn- 
ing model in the PRODA approach? 


3. Whatis the advantage of the PRODA approach in 
comparison to conventional machine learning 
methods in representing SOC distributions? 
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This practice offers guidance on how to use the 
PROcess-guided deep learning and DAta-driven 
modeling (PRODA) approach to integrate obser- 
vations with the biogeochemical module of the 
Community Land Model version 5 (CLMS) to best 
represent regional soil organic carbon distributions. 
Over three exercises, we focus on how to build, 
train, and tune a deep learning model in the PRODA 
approach to predict parameters estimated from 
site-level data assimilation. Readers can use either 
the CarboTrain platform, or the original codes 
(via https://www?2.nau.edu/luo-lab/?workshop) 
to explore different deep learning options and to 
understand and modify the optimization methods. 


RATIONALE OF ESTIMATING PARAMETER 
VALUES BY A DEEP LEARNING MODEL 


In Chapter 37, we discussed the performance of dif- 
ferent approaches that assimilate observations into 
CLMS to best represent soil organic carbon (SOC) 
stocks across the conterminous United States (Tao et 
al. 2020). We concluded that the PRODA approach 
outperforms data assimilation alone by fully inter- 
preting the spatial heterogeneity of parameters. In 
the PRODA approach, parameter values are first 
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retrieved where the observation resides through 
the Markov Chain Monte Carlo (MCMC) method 
and then upscaled to the region by a deep learning 
model. Eventually, we apply the parameter values 
predicted by the trained deep learning model to 
CLMS to simulate SOC stock and distributions (see 
Chapter 37 for details of the PRODA workflow). 

We use various environmental variables to pre- 
dict the parameter values by a neural network in 
the deep learning model. The rationale behind this 
procedure is that parameters in the process-based 
model can be expected to vary with environmental 
conditions over space and time in order to account 
for changing properties of evolving ecosystems 
and unresolved processes (Luo and Schuur 2020). 
The relationships between parameter values and 
environmental variables, however, are often not 
easily identified so as to be explicitly represented 
in model structure. Therefore, we introduce a deep 
learning model to explore such complex relation- 
ships. Specifically, we set the local environmental 
variables as the input and the estimated parameter 
values in the site-level data assimilation as the pre- 
diction target in a deep learning model. Then, the 
deep learning model is trained to best predict the 
parameter values based on input environmental 
variables (Figure 38.1). 
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Figure 38.1. Schematic diagram of using a deep learning model to predict parameter values. The deep learning model is used 
to interpret relationships between environmental variables and parameter values retrieved from the site-level data assimilation. 


Inputs: environmental variables, could be anything 
you think relevant to parameter values (soil carbon 
cycle processes) 


Neurone: basic unit of neural network, represents 
predicted value for next layer. 


Neurone Weight: weights of the neurone value 


Activation Function: a nonlinear function applied 
to weights to generate nonlinearity 


Outputs: predicted parameter values 


Training Target: real parameter value in order to 
calibrate predicted ones. We took the results of 
site-by-site data assimilation as the real value 


Loss function: quantify the difference between 
predicted and real parameter values. The smaller 
the better 


Optimiser: algorithms to find the lowest loss by 
adjusting weights of neurones 


Figure 38.2. Basic elements in a typical multilayer neural network. 


WHATIS ANEURAL NETWORK? 


Chapter 36 introduced the basic concepts of 
machine learning/deep learning. In this practice, we 
focus on how to build and optimize a deep learn- 
ing model that is structured by a multilayer neural 
network. Four elements together construct the skel- 
eton of a typical multilayer neural network, which 
are: inputs, outputs, neurones, and neurone 
weights (Figure 38.2). In this practice, the inputs 
are environmental variables that are relevant to the 
parameters in CLMS, such as climatic, edaphic, and 
vegetation variables. The outputs are the parameter 
values we predict using the trained neural network, 
which will be further applied in CLM5 to simulate 
SOC distributions. The neurones are the basic unit 
of a neural network. They distribute at each hidden 
layer of the neural network (Figure 38.2), receive 
information from the inputs or the previous layer, 
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and generate all possible predictions for the next 
layer or as the final outputs. Finally, the neurone 
weights are the weights assigned to the predictions 
of each neurone. We get the final neural network 
predictions by combining all the predictions by dif- 
ferent neurones with their weights. 

In addition to the four basic elements of a neu- 
ral network, we use the activation function to 
generate the neurone weights. We generally use 
nonlinear activation functions to enable the neural 
network to explore the nonlinearities between the 
inputs and the final outputs. 

We train the neural network to best predict our 
training target (i.e., the parameter values retrieved 
from the site-level data assimilation) via a certain 
optimization algorithm. A loss function is used 
to quantify the difference between the predicted 
parameter values from the “real” parameter val- 
ues retrieved from the site-level data assimilation. 


PRACTICE 10 


The lower the loss function value is, the closer the 
predictions are from the “real” parameter values, 
and the better the model performance is. 

The loss function value alone, however, cannot 
initiate the optimization. We need an algorithm 
to determine how to adjust the neural network 
according to the results of the loss function so 
that the final predictions can best fit the training 
target. In the neural network, we have the opti- 
mizers to adjust the weights of neurones accord- 
ing to the results of the loss function. Ideally, after 
sufficient training, the optimizer will eventually 
lead the neural network to reach a point where 
the loss function reserves the lowest value it can 
pursue. We regard the neural network at this point 
as reaching its global optimum. Predictions by the 
neural network will consequently be the closest to 
the training targets. 


HYPERPARAMETERS IN THE NEURAL NETWORK 


Hyperparameters are the parameters whose values 
control the training process in the neural network. 
Hyperparameters that control the shape of a neural 
network include the number of hidden layers and 
the neuron numbers of each hidden layer. Hidden 
layer numbers determine the depth of a neural 
network. Neuron numbers of each hidden layer, at 
the same time, control the width of the neural net- 
work. Choices of hidden layer numbers and neuron 
numbers are largely empirical. You can try neural 
networks with different shapes and choose the one 
that can best predict your training target. 

The epoch number determines how many 
times the deep learning algorithm will go through 
the whole dataset for training. During each epoch, 
the neural network can propose a set of neuron 
weights and adjust them by the optimizer accord- 
ing to the loss function results. The number of 
epochs ranges from hundreds to thousands in 


EXERCISE 1 


Building and training a neural network that 
uses environmental variables to predict param- 
eter values in CLM5. Follow the instructions in 
CarboTrain: 


a. Select Unit 10 
b. Select Exercise 1 
c. Select Output Folder 


different deep learning applications. You may try 
different numbers to find the best epoch number 
that allows the loss function value to be mini- 
mized, so that the neural network can accurately 
predict the training target. 

In addition, the batch size defines how many 
training data you want to use as one batch when 
working through the whole training dataset in each 
epoch. For example, for a training set with 10,000 
samples, if we set the batch size as 50, then it will 


. i ară sample size 
take 200 iterations ( iteration number = a cease 


batch size 
to go through the whole training set in each epoch 


of optimization. Possible choices of batch size vary 
from 1 to the size of the whole training set. Setting 
batch size as 16, 32, or 64 would be a plausible 
start. 

We also need to decide several hyperparam- 
eters that control the optimization process in the 
neural network before initiating the training. You 
may need to specify the loss function, activation 
function, and the optimizer. We use the Keras 
package in Python to build and train the neural 
network in this practice. Keras provides multiple 
choices for these hyperparameters. The loss func- 
tion can be the mean squared error (expressed as 
“mean_squared_error” in Keras) or other func- 
tions that can quantify the difference between 
predicted and target values. The activation func- 
tion can be the Rectified Linear Unit (expressed 
as “ReLU” in Keras), the hyperbolic tangent func- 
tion (expressed as “tanh” in Keras), the sigmoid 
function (expressed as “sigmoid” in Keras), other 
activation functions that are pre-defined by Keras, 
or a function that you define yourself. Keras pro- 
vides several options for the optimizer. Choosing 
“Adam”, “Adadelta”, or “RMSprop” would gener- 
ally be a good start. You can refer to the website 
https://keras.io/api/ for more possibilities of the 
hyperparameters you can use in Keras. 


d. Open Source Code 

e. Read section “Setting NN Structure” to 
get familiar with hyperparameters that 
we used in this neural network. 

f. Run Exercise 

g. Check results in your Output Folder. Four 
figures will be generated (Figure 38.3). 
The figure in loss_nn.png describes 
the changes of the loss function value 
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Figure 38.3. Output figures in Exercise 1. 


in the training and validation set with 
increased epochs. para nn.png 
describes how well the trained neural 
network predicts different parameters. 
nn obs vs mod.jpeg indicates 
how well the CLM5 model can simulate 
SOC after applying the predicted param- 
eter values from the trained neural net- 
work. map para _ us. jpeg describes 
the predicted parameter maps from the 
trained neural network. 

Note: You can also choose your own 
computer to execute the exercise.You may 
download the code package via https:// 
www 2.nau.edu/luo-lab/?workshop, 
and follow the instructions: 


. Read and operate the Python code via 


“yourpath/nau_training_proda/nn_ 
clm_cen.py”. 


b. After the neural network training, operate 


the R code via “yourpath/nau_training _ 
proda/NN_Indi_MAP_Project_CLM 
CEN.R”. 


. Operate the R code via “yourpath/ 


nau_training_proda/nn_para_map.R”. 
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QUESTIONS: 


il, 


How many layers are set in the default neural 
network? How many neurones are distrib- 
uted at each layer? 


. What is the activation function used in the 


default neural network? 


. How many epochs will the default neural net- 


work run in training? What is the batch size? 


. What is the loss function used in the default 


neural network? Can you justify the reason 
why we need a loss function? 


. In Figure 38.3a, which two hyperparameters 


in the neural network control the calculation 
of loss value and changes to the loss value? 


. In Figure 38.3b, which parameters are pre- 


dicted well by the trained neural network, 
and which not? Why do you think the neural 
network predicts some parameters well, and 
others not that well? 


According to what you have learnt in 


Chapter 37, what will the results pre- 
sented in Figure 38.3d be used for? 


TUNING THE NEURAL NETWORK FOR BETTER 
PERFORMANCE 


Setting up basic neural network structures and 
hyperparameters does not guarantee satisfactory 
model performance in prediction. We need to tune 
the neural network for better performance. We 
will briefly introduce some common procedures 
in tuning the neural network (Figure 38.4). 

Suppose you find the predicted parameter values 
cannot fit well with the targets after training the 
neural network. In this case, you are recommended 
to first try a new model structure with more hidden 
layers (deeper neural network) and/or more neu- 
rons for each hidden layer (wider neural network). 
Expanding the depth and/or width of the neural 
network will generally increase its ability to inter- 
pret more complex relationships between the input 
environmental variables and the target parameter 
values. Meanwhile, you can also consider applying 
different activation functions or optimizers. 

We may sometimes encounter the overfit- 
ting problem. Overfitting happens during train- 
ing of the neural network when the loss value of 
the validation set stops decreasing with the loss 
value of the training set but begins to increase 
(Figure 38.5). In such a case, even though the 
final prediction of the training set may fit well 
with the target, the trained neural network can- 
not make precise predictions in the validation set 
(see a detailed discussion about the overfitting in 
Chapter 36). To avoid overfitting, we can adopt 
early stopping to stop training the neural network 
at reasonable epochs. After early stopping, the loss 
values of both the validation and training sets stay 
low and thus ensure the robustness of neural net- 
work predictions in both datasets. 


Good Results in 
Training? 


Good Results in 
Testing? 


Regularization is another option to pre- 
vent overfitting in training the neural network. 
Regularization applies penalties to the weights of 
neurones to avoid the predictions of the trained 
neural network relying too much on the perfor- 
mance of one or some small group of neurones. 
Regularization is a more advanced option in tun- 
ing the neural network. We may not touch on it 
too much in this book. If you are interested, you 
can refer to the website at https://machinelearn- 
ingmastery.com/how-to-reduce-overfitting-in- 
deep-learning-with-weight-regularization/. 

Dropout offers a further option if we do not 
want the performance of a small group of neurones 
to matter too much in the final prediction of the 
trained neural network. The dropout option allows 
us to randomly permute some certain percent of 
neurones in each epoch of optimization. The neu- 
ral network will then be trained to not depend 
too much on any specific neurones in prediction, 
thereby improving its robustness. If you check the 
default setting in Exercise 1, you will find we used 
the dropout option in the neural network training. 


Stop at here 


Loss 


Validation set 


Training set 


Epochs 


Figure 38.5. Overfitting in neural network training and early 
stopping option. 
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Figure 38.4. Tips on tuning the neural network for better model prediction performance. Reproduced from Lee (2016). 
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EXERCISE 2 


Tuning the neural network used in Exercise 1. 
Follow the instructions in CarboTrain: 


Select Unit 10 

Select Exercise 2 

Select Output Folder 

. Change one or more hyperparameter val- 
ues. e.g., select and change the Optimizer 
to “adam”. 

e. Run Exercise 

f. Check results in your Output Folder. Four 

figures as described in Exercise 1 will be 

generated. 


Loge 


Note: If you would like to use your own com- 
puter to do the exercise, you may follow the 
instructions: 


PRODA VERSUS DATA ASSIMILATION ALONE 
FOR OPTIMIZED SOC DISTRIBUTIONS IN CLM5 


In Exercise 1 and 2, we have introduced how to 
build, train, and tune a neural network to best pre- 
dict parameter values by environmental variables. 
The deep learning model is the core part of the 
PRODA approach. We optimize SOC representation 
in a process model by fully interpreting the envi- 
ronmental dependencies of its parameters through 
a deep learning model. 


EXERCISE 3 


Comparing SOC representations in CLM5 
between PRODA and data assimilation alone. 
Follow the instructions in CarboTrain: 


a. Select Unit 10 

b. Select Exercise 3 

c. Select Neural Network Task Folder 

d. Select DA alone Task Folder 

e. Select Output Folder 

f. Run Exercise 

g. Check results in your Output Folder. 
Three figures will be generated 


a. Open the python code via “yourpath/ 
nau_training_proda/nn_clm_cen.py”. 
Change hyperparameter values at corre- 
sponding locations 

c. Operate “nn_clm_cen.py” 

d. After the neural network training, operate 
the R code via “yourpath/nau_training_ 
proda/NN_Indi_MAP_Project_CLM 
CEN.R”. 

e. Operate the R code via “yourpath/ 
nau_training_proda/nn_para_map.R”. 


QUESTIONS: 


How do the output figures change compared to 
those generated in Exercise 1? Can you explain the 
reasons for these changes? 


Data assimilation alone can also utilize multi- 
site observations to retrieve parameter values so 
that the biogeochemical model can better rep- 
resent SOC distributions. Instead of optimizing 
parameter values at each observational site as in 
the PRODA approach, the retrieved parameter val- 
ues with the data assimilation alone are spatially 
invariant. Exercise 3 will compare results of CLM5 
in terms of the SOC representation when using 
PRODA to set parameter values, compared with 
data assimilation alone. 


(Figure 38.6). Figure mod vs obs 
nn. jpeg shows the agreement between 
the PRODA-optimized and observed 
SOC. Figure mod_vs_ obs ob.jpeg 
describes the agreement between 
retrieved and observed SOC using data 
assimilation alone. Figure soc_map. 
jpeg shows the SOC distributions sim- 
ulated by CLM5 at different depths across 
the conterminous United States after 
optimization by the PRODA approach, 
and following parameter estimation by 
data assimilation alone. 
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Figure 38.6. Output figures in Exercise 3. 
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(a) Agreement between model simulations and observations; (b) SOC dis- 


tributions at different depth across the conterminous United States simulated by CLM5 with parameter estimation by 


data assimilation alone v. PRODA. 


If you would like to use your own 
computer to do the exercise, you may 


follow the instructions: 


. Open the 


R code via 


“yourpath/ 


nau_training_proda/NN_Indi_MAP_ 
Project_CLM_CEN.R” 

. Set variable “ifnn” as ifnn = 1 (PRODA 
results) 

. Operate “NN_Indi MAP Project CLM 
CEN.R” 

„Set variable “ifnn” as ifnn = 2 (data 
assimilation alone results) 

. Operate “NN_Indi MAP Project CLM 
CEN.R” again 
Open the R code via “yourpath/ 
nau_training_proda/Global_ 
Projection _NN_CLM_CEN.R”. 


Set variable “is_nn” as is_nn = 1 (PRODA 
results) 


. Operate “Global_Projection_NN_CLM_ 


CEN.R” 

Set variable “is_nn” as is nn = 0 (data 
assimilation alone results) 

Operate “Global_Projection_NN_CLM_ 
CEN.R” again 


. Open and operate the R code via 


“yourpath/nau_training_proda/ 
Different Method_obs_vs_mod.R” 
Open and operate the R code via “your- 
path/nau_training_proda/different_ 
method_soil_map.R” 


QUESTION: 


Which approach performs better in representing 
SOC distribution? Why? 
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To summarize, this practice has explored the 
technical details of the PRODA approach in opti- 
mizing parameter values and retrieving SOC distri- 
butions. The deep learning model is the core part 
of the PRODA approach. The first two exercises 
illustrate the basic components in a neutral net- 
work (Exercise 1) and how to tune a configured 
neural network for better performance (Exercise 
2). In Exercise 3, we applied CLM5 to simulate the 


SOC distribution across the conterminous United 
States after parameter estimation by data assimila- 
tion alone and further parameter optimization by 
the PRODA approach. We compared the agreement 
between simulations and observations after the 
two approaches. The results show the advantage of 
using the PRODA approach to optimize SOC distri- 
bution in biogeochemical models. 
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APPENDIX1 


Matrix Algebra in Land Carbon Cycle Modeling 


Ye Chen 


Northern Arizona University, Flagstaff, USA 


The purpose of this appendix is to deliver the nec- 
essary matrix algebra foundation you will need to 
understand the matrix model of the land carbon 
cycle, introduced in Chapter 1. If you have taken 
the undergraduate level of matrix algebra or above, 
you may skip this appendix. Please note that some 
important topics in the following are presented in 
a simple way and they could be easily extended 
into longer sections. Professor Gilbert Strang's 
Linear Algebra lecture in MIT OpenCourseWare 
(see Recommended Reading) is a good source for 
extra learning material and self-study. 


MOTIVATIONS 


Many processes can be described using a dynamical 
system. In one type of dynamical system, the state 
Xn+1 Of a system at time n + 1 can be derived from 
the state X, at time n using a transition matrix A: 


Xp+1 = AX, 


See the following problem for more details. 


Problem 1 


Consider a simplified carbon transfer model 
in which 3% of the carbon in the fast soil 
pool moves to the slow soil pool each year 
while 95% of the fast soil pool stays the 
same, and 1% of the slow soil pool moves to 
the fast soil pool while 97% of the slow soil 
pool stays the same with no other influences 
on the two pools. What are the pool sizes 
after 10 years? 50 years? 100 years? 

This question will be fully answered at 
the end of this appendix. To see how linear 


algebra can help us, we will do some basic 
analysis in this section. 

Note that there is 2% lost in each pool 
when transfer occurs every year. To simplify 
the problem, we are not going to track what 
happens with that carbon once it exits the 
two pools. 


f, 
Let X, -| | be the status of the two 
Sn 
pools at nth year. That is, f, is the size of the 


fast soil pool at nth year, and s, is the size of 
the slow soil pool at nth year. Then 


faa =0.95f, +0.01s, 


Sas = 0.03, + 0.975, 


First of all, this linear system can be writ- 
ten down in the matrix form using the 
knowledge from sections 2 and 4 below, 


fr] [0.95 0.01 f, 
Sr] (0.03 0.97 Sa 
or equivalently, X,,¡ = AX, where A is 


the 2 x 2 coefficient matrix. 
Given the pool size of the first year, 


fi 
X% = l , one interesting question is to find 
Sı 


out the pool size a long time after, i.e., X, 
for a large n. In this problem, n = 10, 50, 
100. Note that 


y = Až, = A (Ana) A a = aa A, 
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and finding A"x, for large n can be com- 
putationally intensive when the dimension 
of A is large. However, if we can find the 
eigenvalues of A as introduced in section 5, 
then we may be able to turn the problem of 
finding A”x, with large n into finding A'x,, 
where A is an eigenvalue of A. 


MATRIXOPERATIONS 


Basic Operations 
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Definition 1 


A matrix is a rectangular array of numbers. We say 

that a matrix A is an m X n matrix when it has m rows 

and n columns, and the dimension of A ism X n. 
For example, this is a 2 X 3 matrix: 


Definition 2 


A matrix is square if it has the same number of rows 
and columns. 

We can use subscript notation to refer to 
particular entries in a matrix: the notation 
A, refers to the entry of matrix A in row i 
and column j. 


Problem 2 


Let A be the matrix in equation Al.1. What 
is A,,? 

Answer: The number in row 1 and column 
3 is 5. 


Definition 3 


The transpose of a matrix A, denoted AT, 
is the matrix A with the rows and columns 
switched. That is, (A7); = Aj. 

For example, for the matrix in Equation 
A1.1 


2 4 
A’ =|3 1 
5 -9 


(A1. 


We can add two matrices if they have the 
same dimensions. We do this by adding cor- 
responding entries. For example, 

2 3] [5 0] [2+5 3+0 

+ 
4 -1] |-6 3] [4-6 -1+3 


Problem 3 


For two matrices A and B, is A + B always the 
same as B + A? 
Answer: Yes. 


Matrix Multiplication 


To multiply a scalar number k by a matrix, simply 
multiply k with every element of the matrix. For 


example, 
„o 1 s] fo 2 10 
> i Alla % sal 


To multiply two matrices, the number of columns 
of the first matrix must be the same as the number 
of rows of the second matrix. Let's say that we have 
two matrices, X, which is m X k, and Y, which is 
k x n. Then their product, denoted XY, will be an 
m X n matrix. Here is how to determine the ele- 
ments of the matrix product XY: to get (XY), (the 
entry in the ith row and jth column), take the ith 
row of X and the jth column of Y, multiply their 
corresponding elements, and add the results. 

Suppose we have the matrices 


021 0 
1 2 3 
X= Y=|5 4 2 6 
0 -1 6 
1-30-1 


X has dimensions 2X3, and Y has dimensions 
3x4. Since the number of columns of X equals the 
number of rows of Y, we can multiply them, and 
the result will be a 2 X 4 matrix. 

Now, to determine the entry in the first row 
and first column of the product XY, we look at the 
first row of X and the first column of Y, here shown 
highlighted in blue and red: 
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021 0 
l ad 3 
X= Y=|5 4 2 6 
0 -1 6 
1 —3 0 -1 


We now take these two lists of three numbers, 
multiply them element-by-element, and add the 
results: 


1-0+-2-5+3-1=-7 


So far, we know that the matrix product XY 
looks like this: 


Now let's compute (XY),, (highlighted in red 
above). Since we are trying to compute the entry 
of XY in the first row and second column, we take the first 
row of X, and the second column of Y: 


021 0 
Y =2 3 
X= Y=|5 4 2 6 
0 -1 6 
1-30-1 


Multiplying them pairwise and then adding 
yields 


1--2+-2:4+3--3=-19 


Now XY looks like 


Problem 4 


Finish computing the matrix product XY. 


19 —3 15 
224 =) .=12 


7 
Answer: XY -| i 


Problem 5 


Use the matrices X,Y defined in the previous 
example, and also 


APPENDIX 1 


Compute each of the following if it is 
defined, or say undefined otherwise. 


LYX 2.XA 3.AB 4.BA 


Answer: 1.YX is undefined. 2. XA is unde- 
fined as the dimension of Xis 2 x 3, and the 
dimension of A is 2 X 3: the inner dimen- 


10 14 
sion does not match. 3. AB= l | 4 
—1 7 
BA = . 
l 2 4 
Problem 6 


Based on the results of Problem 5, is AB 
always the same as BA? Explain why. 

Answer: No. What you found in 
Problem 5 is that matrix multiplication 
is not commutative. This is one way in 
which matrices are different from real 
numbers. With real numbers, you are able 
to switch around the order of operands 
being multiplied. But with matrices, you 
have to be very careful: you cannot do this 
with matrices! 

It turns out, however, that matrix multi- 
plication is associative: for any matrices X, 
Y, and Z, as long as they have dimensions 
that match up properly, it is always true 
that (XY)Z = X(YZ). That is, when doing 
more than one matrix multiplication, it 
doesn't matter which multiplication we 
do first, as long as we keep them in the 
right order. This means that we can write 
things like ABCDE instead of ((AB)C) (DE) 
or A(B(C(DE))) or ((AB)(CD))E since they 
are all the same. 


Quiz 1 


a. If the dimension of A is 10 x 20, the dimen- 


sion of B has only one column, and AB is well 
defined. How many columns does the matrix 
product AB have? (Answer: The matrix AB has 
one column). 


. If the dimension of the matrix A is 10 x 25, 


the dimension of the matrix C is 10 x 20, 
and AB = C. What is the dimension of B? 
(Answer: The dimension of B is 25 x 20.) 
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MATRIX EQUATIONS 


Identity Matrix, Inverse Matrix 


Problem 7 


Can you find a 2 X 2 matrix I which, when 
multiplied by any other 2 x 2 matrix X, 
yields X? That is, IX = XI = X for any 2 x 2 
matrix X. I is called the 2 x 2 identity matrix 
(sometimes also written I,). 


1 0 
Answer: I, = 
0 1 


Problem 8 


What is the 3 x 3 identity matrix, 13? In gen- 
eral, what does the n X n identity matrix I, 
look like? 


1 0 0 
Answer: I; =|0 1 O0Oj.Ļisannxn 
0 0 1 


matrix with the main diagonal elements 
being 1, and all other elements being 0. 


Problem 9 


There is no such thing as a 2 x 3 identity 
matrix. Why not? 

Answer: To match the dimension such that 
IX = XI = X in the definition of identity 
matrix, I has to be a square matrix. 


Problem 10 


Multiply the following two matrices: 


2 3 § —3 
A= , B= 
3 5 —3 2 
What do you get? Why is this interesting? 
Answer: AB = L. 


Definition 4 


The inverse of a square matrix A, written 
AT! is a matrix which multiplied by A results 
in the identity matrix: AA”! = A'A = I. 


Problem 11 


As it turns out, not all square matrices 
have an inverse. But this should not be too 


surprising. Why not? (Hint: think about the 
inverse of 0). 
Solving Matrix Equations 


Matrices which have an inverse are called invert- 
ible matrices, and matrices which do not have an 
inverse are called singular matrices. Why do we 
care whether a matrix is invertible? Well, remember 
what you do in algebra to solve an equation like 3x 
= 12: you multiply both sides by 1/3, the inverse of 
3. In the same way, inverting matrices allows us to 
solve matrix equations like AX = Y (where A, X, and Y are 
all matrices). If A is invertible, we can left multiply 
both sides of the equation by A”! to get X = AY. 
Note that Y 4”! is not the solution of X by Problem 6. 
Problem 12 


If A, B are invertible, solve this matrix equa- 
tion for X: 


AXB = Y 


Answer: X = A-IY Bo! 
Example 1 
Solve this matrix equation for X: 
Y = BC + ADKX 
If the matrix product ADK is invertible, 
Y = BC + ADKX 


> ADKX = -BC + Y 


> x =(ADK)*(=8C+Y) 


> X = ( ADK)” (-BC)+(ADK)” Y 


Following the steps of this example, you can 


solve the next problem. 


Problem 13 


The matrix equation below is presented in 
Chapter 1, Equation 1.6: 


X'(t) = Bu (t) + Ag (t) KX(t) 
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Can you solve it to obtain an expression 
for X(t) if the matrix Aé(t)K is invertible, 
X' (0), B, A, E(t), K and X(t) are matrices, and 
H(t) is a scalar? 

Answer: You may have noticed this equa- 
tion is similar to the matrix equation in 
Example 1, except the notations are dif- 
ferent. The computation would still follow 
the rule of matrix operations. And to solve 
this equation for X, just follow the steps in 
Example 1. The solution is 


X(t)=(A&(t)K) Bu(9)+(4(9) x(t) 


Quiz 2 


Let A, C be invertible matrices. Solve AXC + BD = Y 
for X. (Answer: X = A~! (Y — BD)C”!.) 


LINEAR SYSTEM 


Definition 5 


A linear equation in the variables x,, x,,--- , 
x, is an equation that can be written in the 
form: 


dX; + aX, to +4,X, =b 


where a,,4,,..., 4, and b are real numbers. Thus, 
for example, 


2x, + 3x, +x; — 11x4, = 6 


is a linear equation in the four variables 
Xi X» X3, X4. But the following equations are 
not linear: 


2X, + 2x, + XX4 = 6 


2x? +3x, =6 


Definition 6 


A system of linear equations (or linear 
system) is a set of linear equations. 
The following is a simple linear system. 


Example 2 


Solve the linear system: 


(1) -a+ =6 


(2) x, +x, =10 


This system can be easily solved by hand 
using forward and backward substitution. 
The solution is 


xı = —2,x, =12. 


Note that with the matrix multiplication, 
the system is equivalent to 


-1 ; 6 
x +x = 
Je oe | Lo 


Therefore, the solution can be written as 


= 1] Ts 
=) +12] 3|= 
1 1 10 


Here is another approach using the matrix 
inverse. The linear system can be written 
down in the matrix form: 


-1 Klx] |6 
1 dix} 10 
We denote this matrix form as 


Ax =b 


To find the solution of this linear system 
for x, we left multiply this equation by A”! 
if it exists, then 


x = Alb 


EIGENVALUES AND EIGENVECTORS 


Definition 7 


Let A be a square matrix. 

A vector X is an eigenvector of A if 
x #0 and there is a scalar A such that 
AX = AX. 
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_ 0.95 0.01 }|-1 —0.94 = 
AY, = = = 0.94, 
0.03  0.97|| 1 0.94 
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A scalar A is an eigenvalue of A if there is 
a vector ¥ #0 such that AX=AX. 

Remark 1 By this definition, if A is an 
eigenvalue of A and X is a correspond- 
ing eigenvector, then we must have the 
following: 


Next, we are going to apply Definition 7 
to verify the eigenvalues and eigenvectors of 
a matrix. 


Example 3 
=] 
Confirm that both Y, -| | and 


1/3 
Ya -| i | are eigenvectors for the matrix 


0.95 0.01 
0.03 0.97 


| and find the corre- 


sponding eigenvalues. 
Solution: As 


by the definition of eigenvalue, Y, is an 
eigenvector for A with corresponding eigen- 
value 4, = 0.94. 
Similarly, it is easy to verify that Y, is an 
eigenvector for A with corresponding eigen- 
value 4, = 0.98. 
In this example, the eigenvectors are 
already given, so it is relatively easy to deter- 
mine the eigenvalues. In general, to deter- 
mine the eigenvalues and eigenvectors by 
hand could be very hard or even impossible 
when the dimension of A is large. Luckily, 
modern programming environments like 
MATLAB and Python provide functions for 
this purpose. 


Example 4 


Let us continue finding the solution for 
6 
Problem 1. Let x, = , j and compute the 


pool sizes after 50 years. 


Solution: By Example 2, we have 


=-2 +12 E which is 
10 1 1 


equivalent to x, =—21, +12¥, . So, 
A'Z = A" (-2¥, +127,) =-24"7, +124"7, 


By Remark 1 and Example 3, AY, = Ati, 
A"Y, = A; v . We then have 


AX, =-2:0.94"v,+12-0.98" y, 


By Problem 1, to determine the pool sizes 
after 50 years, it is equivalent to find x), 
where X; = AX. By the analysis we just did, 
we know that this can be computed easily: 


Xs, = Ax, =-2:0.94%y, 


1.55 
+12-0.98%7, = 
4.28 


Therefore, after 50 years, the size of the 
fast pool is 1.55, and the size of the slow 
pool is 4.28. 


Quiz 3 


2 
Continue with Problem 1. Let X= ji 
Compute the pool sizes after 20 years. (Answer: 
As X, = 6%, we then have A'X, = 6-0.98"¥,. So, 
E o. [134 l 
X,1=6:0.98 y, = . After 20 years, the size 

4.01 

of the fast pool is 1.34, and the size of the slow 


pool is 4.01.) 
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This appendix is intended to equip readers with 
little or no programming experience with basic 
skills to write and read programs in the Python 
language. Python is a powerful programming 
language widely used in many applications. This 
appendix introduces basic programming knowl- 
edge in Python including variables, operators, 
function, class, module, etc. All code examples 
in this appendix come from Python files (test _ 
p2.py, GeneralModel.py and model. 
py), which will be used in practice chapters 8, 
12, and 16. Python 3.7 is preferred for the practice 
chapters of this book. Readers are not expected to 
be an expert in programming but to acquire the 
basic ability to read/write Python codes for the 
practice chapters. 


WHATIS PYTHON AND HOW DOESIT RUN? 


Python is a programming language used for 
general-purpose software engineering. It is one 
of the most popular languages among scientists, 
engineers, and mathematicians for the following 
reasons. First, Python has simple syntax similar 
to the English language. It is easy to read, write, 
learn, and maintain Python code. Second, Python 
can work seamlessly on different platforms such 
as Windows, Linux and Mac. Third, Python is an 
interpreted language, which means that it executes 
the code line-by-line. If any error occurs, it will 
stop further execution until programmers are able 
to locate the errors and fix them. Fourth, Python 
comes with many great standard libraries to pro- 
vide users with a vast choice of functions needed 
for their tasks. 

Similar to English, Python is a language com- 
posed of vocabulary and syntax. The vocabulary in 


English comprises different words. The vocabulary 
in Python comprises operators, variables/operands 
and keywords. As in common languages, the right 
syntax must be applied when linking different parts 
of the vocabulary into meaningful statements. In 
English, the sentence ‘learn python I’ is not syntac- 
tically valid. In Python, the expression “hello” +9 is 
not syntactically valid. Similar to learning English, 
readers will learn the most common vocabulary 
and syntax of Python from this appendix. 

Before reading further, please refer to Appendix 
3 to install Python on your computer. 

There are two ways to run a Python program. 
One way is to use an interactive shell window. The 
shell will prompt >>> and wait for the user to type 
in a line of Python code. Given the code input, the 
shell will execute this code, display the results and 
wait for the next code input. For all but the simplest 
operations, we normally wish to gather consecutive 
lines of code in a Python source file (xxx. py). 
We may execute a Python source file with a com- 
mand like python xxx. py. All code lines in this 
source file will be executed one after the other until 
the end of the program is reached. In the practice 
chapters, we will use this second way. 


THE FIRST PYTHON PROGRAM 


A Python program is a collection of code that 
manipulates data. Figure A2.1 is a code segment in 
the test_p2.py program. Each code line except 
for the annotation lines is called an expression. These 
expressions will be executed by the Python inter- 
preter. There are two ways to denote an annotation. 
One is a single-line annotation starting with # as 
shown in Figure A2.1. The other is multiple-line 
annotations using * or ” symbols. The expression 
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if name_ ==" main  ': in line 6 of 
Figure A2.1 indicates the start of a Python program. 
If one line contains multiple code expressions as line 
12-13, it is recommended to use semicolon (;) to 


separate different expressions. 

One of the most common expressions is to assign 
a value to a variable like lines 8-26 in the sample 
code of Figure A2.1. Taking line 12 as an example, 
the variable f31 will store the value of 0.72. This 
value can be retrieved by this variable name after 
this code line. A variable must be assigned a value 
before any expressions that use it. Readers will learn 
more about variable types in the next section. 

Another common expression is to print out on 
the screen. Figure A2.2 is to display the last elements 
in the res variable (a two-dimensional array) on the 
screen. Programmers often use the print expression 
to check values stored in variables, which is a useful 
approach for debugging the Python program. 

Python uses colons to indicate an indented 
code block, which often appears in while-loop, 


for-loop, if-else condition procedure and function 
definitions. Readers will learn more about these 
procedure definitions in the following section. The 
code groupings in these procedures are indicated 
by white space or indentation (Figure A2.3). The 
right level of indentation is important — too much 
or too little space will induce an error. 


VARIABLES AND OPERATORS 


As we learned in the previous section, variables 
are one of the primary vocabulary elements in the 
Python language. Similar to words, variables are 
containers to store information such as numbers 
and string values. The information type decides 
the variable type. Generally, the variable types in 
Python are integers (e.g., 1, 2, 3, ..), rational num- 
bers (e.g., 3.1415926), complex numbers (e.g., 12 
+ 0.2i), strings (e.g., “helloworld”), Boolean val- 
ues (i.e. TRUE or FALSE) and NaN. The final vari- 
able type is special, and it only has one value that is 


0.72; f41 = 0.28; f42 = 1; f53 = 0.45; f54 = 0.275; f64 = 0.275; 


0.03; f57 = 0.45; 


eee will not be executed 


6 if __name__ == '__main__': 

7 

8 output_folder = sys.argv[1] 

9 

10 B = np.array([0.45, 0.55, 0, 0, 0, 0, @]).reshape([7,1]) 
11 

12 f31 = 

13 f65 = 0.296; f75 = 0.004; f56 = 0.42; f76 = 
15 A = np.array([-1, 0, 0, 0, 0, 0, 0, 

16 0, -1,0,0,0,0,0, 

17 f31, 0, -1, 0, 0,0, 0, 

18 f41, f42, 0, 1, 0, 0, 0, 

19 0, 0, f53, f54, -1, f56, f57, 
20 0,0, 0, f64, f65, -1, ©, 


#turnover rate per day of pools: 


Figure A2.1. 


67 print(res[:,nyear-1]) 


Figure A2.2. Code example of print expression. 


21 0, 0, 0, 0, f75, £76, -11).reshape([7,7]) [+ tranter] 


foliage, wood, metabolic litter, structural 


= | 


#litter, soil microbial,slow soil, passive soil 
temp = [0.00176, @.000100104, 0.021468, 0.000845, 0.008534, 8.976e-005, @.00000154782] 


Code example of defining an annotation with #. This code segment is from the test_p2. py program. 


# print result of the last year 


sr the first for-loop code block 


56 for A in range(1, 4 


“for j in range(1, 4 


I plt.xlabel("year", fontsize = 12) 


s8 r uta si A y ce ci e i! 
59 1 1l break 1 If-else code block 1 
60 | Lax E PIT. .SubpTot(3, 3, (i-1) * 3 + j) E 
61 | lax.plot(x, res[(i-1) x 3 + j - 1,:]) 0 
62 ] 

1 


l pit. ylabel(pool_names[(i-1) * 3 + (j-1)] + " pool ($g C m^{-2}$)", fontsize = 12) ! 


Figure A2.3. A Python program uses organized spaces to indicate code grouping. This code example includes two for-loop 
code blocks (red and green arrows and texts) and an if-else code block (blue arrow and text). 
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none. Other programming languages such as Java 
require you to declare a variable type before using 
it. Python, however, does not require variable dec- 
larations. Variables with the same name can be 
used in different places in a Python program to 
store different types of values. 

Operators are another primary vocabulary ele- 
ment in Python. Table A2.1 shows a collection of 
binary operators performing calculations on two 
variables. They may work on numbers or Boolean 
values. The final two operators are special. The 
numerical calculation and assignment are per- 
formed at the same time. The expression a /=b, for 
example, first calculates the value of a divided by b 
and then assign this as a new value for the variable a. 

Another operator collection is about control flow. 
Control flow is to decide which and how expressions 
are to be executed. We will use examples in English 
to introduce two common control flows in Python. 
The first one is conditional control flow. The exam- 
ple in English is: ‘If tomorrow is sunny, I will go hiking, other- 
wise I will stay at home’. In the conditional control flow, 
only one set of statements will be executed, either 
‘go hiking’ or ‘stay at home’. The syntax of conditionals 
in Python is illustrated in Figure A2.4a. It starts with 
a <BooleanExpr> expression whose value is either 
TRUE or FALSE. If the value of this expression is 
TRUE, then <ExpressionT1>, ..., <ExpressionTk> 
will be executed; If it is FALSE, then another set of 
expressions <ExpressionF1>, ..., <ExpressionFk> 
will be executed. A code example of if-else con- 
trol flow is shown in Figure A2.4b. If the variable 
type of self.input_fluxes is an array, the variable self.tmp_ 
input_fluxes will be assigned a specific element in this 
array variable. Otherwise, self.tmp_input_fluxes saves the 
whole values in self.input_fluxes. 

The second type of control flow is a loop, 
including a for-loop (Figure A2.5) or a while-loop 
(Figure A2.6). A loop repeats a set of statements over 
and over until a termination status is reached. An 
example in English is ‘repeat taking medicine until you feel 
better’. The syntax of for-loop control flow is shown 


TABLE A2.1 
A collection of binary operators in python 


Operator Description 

a+b sum 

a-b difference 

a*b product 

a/b division 

a//b The integral part of the quotient 
when a is divided by b 

a%b The remainder when a is divided 
by b 

a**b a to the power of b 

a|b True when either a or b is True 

a&b True when both a and b is True 

nota True if a is False 

a==b True when a equals to b 

al=b True when a doesn't equal to b 

a+=b a=a+b 

a/=b a=a/b 


in Figure A2.5a. It starts from retrieving the first 
value from a <listExpr> to a variable <var>, then 
executing <Expression1>, ..., <ExpressionN>. 
This set of statements will be repeatedly executed 
when every value in <listExpr> is retrieved. At 
the end, the <var> will store the last element in 
<listExpr>.The code example of a for-loop (Figure 
A2.5b) uses a useful procedure, range(10), as the 
<listExpr>. Generally, range(n) returns integers from 
0 up to n-1. The for-loop iterates this numerical 
list from 0 to 9, assigns each element in the list to 
a variable t and prints the value of t to the screen. 
Generally, any for-loop control flow can be 
rewritten into a while-loop flow. The code exam- 
ple in Figure A2.6b generates the same results as 
that in Figure A2.5b. The special characteristic of 


(a) (b) 
if <BooleanExpr>: 
<ExpressionT1> 37 if type(self.input_fluxes) == np.ndarray: 
jane 38 self.tmp_input_fluxes = self.input_fluxes[idx] 
<ExpressionTk> 39 else: 
5 else: 40 self.tmp_input_fluxes = self.input_fluxes 
<ExpressionF1> 
<ExpressionFk> 


Figure A2.4. The statement form of conditional control flow (a), and its code example in GeneralModel . py (b). 
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(a) 


for <var> in <listExpr>: 


2 <Expression1> 


4 <ExpressionN> 


(b) 


1 for t in range(10): 
2 print(t) 


Figure A2.5. The statement form of for-loop control flow (a), and its code example (b). 


(a) 

1 while <BooleanExpr>: 
2 <Expression1> 

4 <ExpressionN> 


(b) 


1 t=0 

2 while t<10: 

: print(t) 
t=t+1 


Figure A2.6. The statement form of while-loop control flow (a), and its code example (b). 


a while-loop is that programmers do not need 
to know in advance how many times the set of 
expressions needs to be repeated. The syntax of a 
while-loop control flow is shown in Figure A2.6a. 
It starts by evaluating <BooleanExpr> whose 
value is either TRUE or FALSE. If the value of this 
expression is TRUE, the set of expressions will be 
executed and the <BooleanExpr> will be evalu- 
ated again. If the expression of <BooleanExpr> 
gets FALSE, the execution of this loop will be ter- 
minated. Generally, programmers initialize a vari- 
able before the while loop, such as variable t in 
the code example (Figure A2.6b).The value of this 
variable is changed each time by the set of expres- 
sions in the loop block to be repeated and the new 
value is be tested in the <BooleanExpr> to decide 
whether the loop is to be terminated or not. 

The break keyword offers another mechanism 
to exit a loop which is currently being executed. It 
works both with a for-loop or a while-loop.To make 
the program more readable, it is suggested to avoid 
the break keyword. As mentioned above, colons 
are used to indicate an indented code block in the 
control flow. Expressions after the colon have to be 
indented to one level (Figure A2.3). For example, in 
an if-else control flow, expressions after if and else 
keywords are at the same level of indentation and 
this indentation level is below the if and else. 


ADVANCED VARIABLES AND OPERATORS 


The variables and operators we have seen so far are 
known as primitive variables and operators. They 
are capable of coping with simple programming 
tasks. However, if we only use the primitive variables 
and operators, programs can quickly become long 
and messy. For longer programs, a modular design 


is preferred to make code clear and readable, and 
easier to modify or debug. Modularity entails aggre- 
gating primitive variables and operators into more 
advanced variables or operators. Advanced variables 
are centered around the organization of data. These 
advanced variables include list, array, dictionary, and set. 
Because the practice chapters of this book mainly 
use the list type, this appendix will only introduce 
list. For other advanced variables, please refer to sug- 
gested reading materials at the end of this appen- 
dix. Advanced operators are collections of primitive 
operations such as basic arithmetic, conditional eval- 
uation and recursion. This appendix will introduce 
three advanced operators: function, class, and module. 


The List Variable 


A list is a collection of ordered and mutable vari- 
ables such as numbers, strings, or boolean values. 
A list is written using square brackets [ ], with ele- 
ments separated by commas. Figure A2.7 shows an 
example of defining a list and some common oper- 
ations on the list. Each element in the list is num- 
bered starting from 0. Therefore, the first element 
is at index 0, the second element is at index 1, and 
the final index is one less than the size (number 
of elements) of the list. We can retrieve the final 
element of a list called varList using the expression 
varList[len(varList)— 1]. We can also select a continuous 
part of the list through slicing. For example, varL- 
ist [0:3] returns the first three elements (i.e., 
varList[ 0], varList[ 1], varList[2]) of varList. Remember 
that the index after the colon in the slicing operator 
(i.e., 3) will not be included in the result. The slic- 
ing operator always returns a new list. Therefore, 
varList[O] returns a variable value while varList[0:1] 
returns a list which only includes one variable 
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L=[2020,"training",[3,4]] #define a list with 3 elements. The third element is another list 


L[0]=2021 # modify the first element to 2021 


L[0:2] 4 return the first and second elements, 


[2020] 
L[@] 4 return a number, 2020 
len(L) # return the length of the list, 3 


L.extend([1,2]) # add more elements to L, now L is [2020, 


1 
2 
3 
4 L[0:11 4 return a new list, 
5 
6 
7 
8 


[2020, "training"] 


"training", [3,4],1,2] 


L.remove(2020) # remove the element 2020, now L is ["training",[3,4]] 


Figure A2.7. Common operators of list variable type. 


varList[ 0]. Some other common operations are illus- 
trated in Figure A2.7, such as getting the length of 
a list, inserting and removing elements. 


The Function Operator 


As an advanced operator, function can be viewed 
as a collection of primitive variables and opera- 
tors. Specifically, one function f1 can call another 
function f2. As a result, all basic operators in f2 are 
included in f1. One powerful characteristic of a 
function is that it can be used as a black box and 
we only care about the argument inputs and out- 
puts returned. Another feature of function is reus- 
ability of code. For example, if we want to decide 
whether a numerical value is even or odd, we need 
to write four lines of code for each value. The 
number of code lines is four times the number of 
values (Figure A2.8a). Gathering these four lines 
of code in a function helps keep the code clean 
and efficient. In Figure A2.8b, we put all expres- 
sions used in determining if a number is even or 
odd into a function called Evenodd. We can invoke 


(a) 

1 if 103%2==0: 

2 print('even') 
3 else: 

4 print('odd') 
5 if 1100%2==0: 

6 print('even') 
7 else: 


8 print('odd') 
9 if 530%2==0: 


10 print('even') 
11 else: 

12 print('odd') 
13 if 79%2==0: 

14 print('even') 
15 else: 

16 print('odd') 


the function by typing its name, followed by an 
argument in parentheses. Each time the function 
is called, it determines whether the argument is 
even or odd and prints out the result on the screen. 

Figure A2.9 shows the syntax of a function 
definition. We write a sequence of expressions 
inside the function and give that function a name. 
The sequence of expressions can be executed at 
any point in the Python program by calling the 
function name. In the example of Figure A2.9, 
the function name is get_df. In function defini- 
tion, variables in parentheses () are the input and 
become available for use by expressions inside the 
function body. These input variables are also called 
parameters or arguments. When calling a func- 
tion, we need to pass a variable storing specific 
values to the function. In the example of Figure 
A2.8, the Evenodd function requires a numerical 
value. If we want a value to be returned after a 
function is called, we need to add a return expres- 
sion at the end of the function definition like line 
81 in Figure A2.9. Remember, the indentation in 
a function definition is important. In the function, 


(b) 
def Evenodd(x): 
2 if x%2==0: 
3 print('even') 
4 else: 
5 print('odd') 


7 Evenodd(103) 
8 Evenodd(1100) 
9 Evenodd(530) 
10 Evenodd(79) 


Figure A2.8 Code example to decide whether each of the numbers 103, 1100, 530, and 79, is even or odd, and display the 
answer on the screen. (a) Version using simple code operators; (b) Version using a function to keep the code clean and efficient. 
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start with def keyword 


name of the function 


def get af (self. 
78 res = 


inputs 


self.Y.y.T 


79 df = pd.DataFrame(res) 
df.columns = self.create_headers() 


return outputs 


Figure A2.9. The syntax of function definition from GeneralModel .py. 


def funci(a,b,c): 


4 def func2(a,b,c): 

i Ts... x 
p == =» func3() ~S_ 
1 é 
l 
| 
I 


def func3(): 


Figure A2.10. A workflow of function calls. 


all expressions after the def keyword and comma 
have to be indented one level below. 

A function can call another function. With 
more and more functions defined, which one is 
the highest-level function to call others? As we 
learned in the first section, a code line with the 
expression if main’ 
indicates the start of a Python program no mat- 
ter where it occurs in a source code file. The code 
block with the start of program execution is called 
the main program. In Figure A2.10, three func- 
tions (func1, func2, func3) are defined. The main pro- 
gram calls funcl and func? and func? further calls 
func3. The variables in funcl, func2, and the main 
program have the same names: a, b, c. How to dis- 
tinguish them? Python uses namespace to do so. 
Namespace is an isolated scope where variables are 
valid. So funcl and func2 have local namespaces and 
the main program has a global namespace as well. 
The a,b, and c variables can have different values in 
different namespaces. If we want to change a vari- 
able created in the global namespace, it is required 
to clarify the global property of the variable by 
using the global keyword before the name of 
the variable. 


main == 5 


The Class Operator 


A class is a mixture of variables and functions, 
which are called attributes. Just like a function, 
once defined, can be used many times, a class may 
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Indicate the start of 
w the program 


c=3 


S func1(a,b,c) 
... 6 ~™ func2(a,b,c) 
7 Call each function 


be defined, and then multiple instances can be 
made. Each class instance is called a class object 
and maintains its own attributes, which provides a 
way to reuse code to keep the program clean and 
efficient. This programming style is called object- 
oriented programming. 

Figure A2.1 la shows the syntax of a class defi- 
nition. The BaseClassName is an abstract class to sup- 
port inheritance. A more detailed introduction to 
inheritance is available in the recommended read- 
ing materials. In practice, most expressions inside 
a class definition are function definitions. Figure 
A2.11b lists all functions defined in the GeneralModel 
class from GeneralModel .py. A special func- 
tion is for instantiation and its name is _ init _ 
(spelt with double underscores on both sides of 
init). The instantiation function is first executed 
whenever a class object is created and used for the 
first time within a Python program. The instan- 
tiation function in the GeneralModel class (Figure 
A2.11b) requires six parameters and self key- 
word represents the object itself. Figure A2.11c 
shows an example to initialize a GeneralModel object 
(mod). Typically, the expression is the class name 
followed by a list in parentheses of the parameters 
required by the instantiation function. The syntax 
to access attributes of the class object (i.e., vari- 
ables and functions) is obj.name where obj is the class 
object and name is the variable name or function 
name. Figure A2.1 1c shows some examples of call- 
ing functions defined in GeneralModel. 
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class ClassName(BaseClassName) : 


__init__(self, times, B, A, K, iv_list, input_fluxes, xi=1): 


1 
2 <Expression1> 
4 <ExpressionN> 
(b) 
1 class GeneralModel (Model): 
2 def 
4 def ode_solver(self): 
5 se 
6 def get_input(self, t, y): 
7 wens 
8 def right_hand_equation(self, t, y): 
9 Shue 
0 def get_x(self): 
1 ne 
2 def get_df(self): 
4 def write_output(self, filename): 
5 ... 
16 def get_diagnostic_variables(self): 
7 e. 
8 def sasu_spinup(self): 
9 ee 
20 def create_headers(self): 
21 sa 
22 def get_x_df(self): 
23 hers 
24 def get_pool_n(self): 
25 Eks 
26 def get_c_input(self): 
27 
(c) : er 
46 mod = GeneralModel(times, B, A, K, iv_list, input_fluxes) 
48 res = mod.get_x() 


Figure A2.11. (a) Syntax of class definition; (b) functions defined in GeneralModel class; (c) code expressions to initialize a 


GeneralModel object and to call a function attribute of the object. 


se” The third way to import a module 


1 from GeneralModel import GeneralModel 


2 import numpy as np 
3 import matplotlib.pyplot as plt 
4 import sys 


} The second way to import a module 


«The first way to import a module 


Figure A2.12. Three ways to import a module. Codes are from test_p2.py. 


The Module Operator 


A module is a file containing reusable variables and 
functions. Unlike a class, which enables the instan- 
tiation of multiple objects and modification of 
attributes after creation, a module is a static storage 
of the reusable attributes. Similar to namespaces, 
attributes in one module file are not visible to other 
files. To make them visible, we need to load the 
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module file first before using the variables or func- 
tions in the file. Figure A2.12 shows three ways to 
load a module file using the import keyword. 
One way is to load it directly without assigning a 
simple name to this module. We can retrieve the 
variables or functions in the module via its default 
name. The other way is to load the module and 
assign a new name to it. In the example of Figure 
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= 
ES) 
= 


import numpy as np 


np.abs(-18) # return the absolute value, 18 
np.ndarray # an array object 


OnNanrwn = 


o 


= 
[ej 
=~ 


import pandas as pd 


won > 


K=np.zeros(49).reshape([7, 7]) # create an array with 49 zeros and convert it to 7*7 matrix 
np.multiply(K,1/864) # multiple each element in K and 1/864 

np.matmul(A,B) # return the multiplied matrix product of two arrays A and B 
np.linspace(0,10x365,num = 10) 4 return 10 evenly spaced numbers over the interval [0,10*365] 


np.linalg.inv(matrix_AK) # return the inverse of matrix_AK 
np.sum(A) # return the sum of elements in matrix A 


df=pd.DataFrame(Xc) # return a tabular data (df) from Xc 
df.Columns=[ 'X1','X2','X3'] 4 assign labeled column axis 


4 df['X1'] = carbon_storage # arithmetic operations align on both row and column 


labels. 


5 df.to_csv(filename, index=False) # write to a csv file without row names 


1 from scipy.integrate import solve_ivp 


2 solve_ivp(func,t,y@,t_eval=times,vectorized = True) 
3 # Solve a system of ordinary differential equations given an initial value, e.g. 


dy/dt=func(t,y) and y(t0)=y0 


import matplotlib.pyplot as plt 
fig = plt.figure(12, figsize=(14, 7)) 


t_eval is times at which to store the computed solution according to t 


# 
5 # t is the interval (t0,tf) where the solver starts with t=to and ends t=tf 
# 
# vectorized indicates whether func is implemented in a vectorized fashion or not 


# create a figure.12 is its identifier.figsize sets its width and height in inches 


ax = plt.subplot(nrow, ncols, index) 


# add subplot to a current figure at the specified grid 


1 
2 
3 
4 plt.subplots_adjust() # specify the subplot layout 
5 
o 
7 


ax.plot(x,y) # plot y versus x as lines and/or markers 


8 plt.xlabel('year') # add x labels 
9 plt.ylabel('pool') # add y labels 
0 


plt.savefig(fileName) # save the figure to a file 


Figure A2.13. Code examples in GeneralModel .py and test_p2.py using (a) Numpy module, (b) Pandas module, 


(c) Scipy module, and (d) Matplotlib module. 


A2.12, we import the Numpy module and rename 
it as np. Then we can call functions via np.funcName 
such as np. zeros (49) .The third way is to load 
specific variables or functions from a module file, 
which saves memory in runtime. Unlike the first 
two ways, these variables or functions loaded can 
be used without the module name. 

In the practice chapters of this book, we will 
use Numpy, Pandas, Scipy, and Matplotlib modules. Numpy 
offers comprehensive mathematical functions 
working on arrays. Pandas provides functionality to 
manipulate tabular data as a two-dimensional data 
structure. Scipy is based on Numpy and solves scien- 
tific and mathematical problems. Matplotlib contains 
functionality for plotting data. Figure A2.13 shows 
some common functions from these modules that 
are useful for the practice chapters of this book. 


SUMMARY 


If you have read this appendix carefully, you should 
now have a basic knowledge of Python program- 
ming including variables, if-else conditional con- 
trol, for-loop, while-loop, list, function definition, 
class definition, and loading module. This knowl- 
edge is sufficient to perform the programming tasks 
of the practice chapters of this book. If you wish 
to learn more about Python programming, you can 
refer to the learning resources referenced below. 


SUGGESTED READING 


e https://wiki.python.org/moin/ 
BeginnersGuide 


e http://pythontutor.com 
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https: //thepythonguru.com QUIZZES 


https://pymbook.readthedocs.io/en/ 


1. What is an annotation? 
latest/ 


2. What is an operator? 

https: //docs.python-guide.org P 
3. Can one function call another function? 

https://www.w3schools.com/python/ 

4. How would you express an if-then sentence in 


Python? 
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Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


APPENDIX3 


CarboTrain User Guide 


Yuan Gao 
Northern Arizona University, Flagstaff, USA 


This appendix chapter provides a guide for the 
software CarboTrain, which is short for Carbon 
cycle modeling Training course, and is tailored for 
use in the training course New Advances in Land Carbon 
Cycle Modeling. Modeling studies usually require 
extensive techniques in programming. However, 
the main goal of the training course is to acquaint 
the trainees with advances in modeling land car- 
bon dynamics. CarboTrain is designed to help 
trainees to reach their learning goals without get- 
ting bogged down in programming, in which they 
may have very different levels of skill. The software 
implements all the exercises in Units 2-10 of the 
training course, and of this book. 


INTRODUCTION 


CarboTrain has a user interface as shown in Figure 
A3.1, and it can run on computers running the 
Windows or macOS operating system. 

This software was developed with Python 
3.7.9 and PyQt5. In order to run the software 
properly, some other software systems have to 
be pre-installed. The following section provides 
a step-by-step guide on how to install and use 
CarboTrain. Please be aware that CarboTrain is not 
compatible with the CPU from Apple Silicon. 


DOWNLOAD CARBOTRAIN 


We have a docker image available on docker hub 
https: //hub.docker.com/r/gaoyuan199325/ 
carbotrain. If you have docker installed you are 
ready to go. We recommend using the docker 
image because we have all the essential software 
installed, so you don't need to create any virtual 
environments or struggle installing the software 


on your computer. If you choose to use the docker 
image, you may skip this part and jump to “Uses 
of CarboTrain” further down in this appendix. 

If you want to run CarboTrain without docker, 
you may download and install it with essential 
software on your computer. First, please download 
the software from: https: //www2.nau.edu/luo- 
lab/download /CarboTrain.zip. 


PREREQUISITE SOFTWARE 


Three pre-installed software packages are required 
to run CarboTrain: 


1. Python 3.7.9 and relevant packages 
2. Fortran complier 


3. R 


Since different operating systems may have differ- 
ent ways to install all the software needed, we will 
show how to install these software systems with 
different operating systems. 


INSTALLATION ON WINDOWS 
Install Python 3.7.9 


Download Python 3.7.9 from https://www. 
python.org/downloads/release/python-379/ 
(“Windows x86-64 executable installer”) and 
install it on your computer. When installing Python, 
check “Add Python 3.7 to PATH” as shown in 
Figure A3.2. After installing Python, open the CMD 
window to see whether it is installed correctly. 
To open the CMD window, type in “CMD” in the 
search bar of your computer, as shown in Figure 
A3.3. Follow the steps in Figure A3.4 to check 
whether you have installed Python successfully. 
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EX MainWindow = E x 
Select Task 
Unit Select a unit Y Times 100 Y 
Exercise | Select an exercise v 
Config2  Config3and4  Config5  Config6 Config?  Config8and9.1  Confg9.2and9.3 Config 10.1 Config 10.2 Config 10.3 
-90<Latmin<90 LatminsLatmax<90 O<Lonmin<360 Lonmin<Lonmax<360 
1980<Start year 52010 Start year End year 2010 Spatial temporal 
yes v Iyes v 
Out Dir Set Output Folder 
Run Exercise 
Figure A3.1. The Graphical User Interface (GUI) of CarboTrain. 
% Python 3.7.9 (64-bit) Setup — Xx 


9 Install Now 


pyth n 
windows 


Figure A3.2. Steps to install Python on your computer. 


Fortran Complier 


Go to the website https: / /sourceforge.net/proj- 
ects/mingw/ and download the software as 
shown in step 1 in Figure A3.5. Then, step 2 is 
to run the downloaded file by double-clicking it. 
Next, you need to click the “Install” button to start 
the installation. Finally, you need click “Continue” 
several times to finish the installation. 
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Install Python 3.7.9 (64-hi 
Select Install Now to install Python with defa [2] 


Customize to enable or disable features. 


C:\Users\yg336\AppData\Local\Programs\Python\Python37 


Includes IDLE, pip and documentation 
Creates shortcuts and file associations 


> Customize installation 
Choose location and features 


| E Install launcher for all users (recommended) 


DI Add Python 3.7 to PATH Gara 


a 


(1% Check it y 


Once the installation is finished, another win- 
dow will be popped out automatically as shown in 
Figure A3.6. 

Check each box in Figure A3.6 and then select 
“Mark for Installation”. Then go to “Installation 
and click “Update Catalogue” as shown in Figure 
A3.7 to confirm the changes. You then need to 
click “Review Changes” in the pop-up window 
and then click “Apply” in the following pop-up 
window to install all the packages needed. 
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Turn indexng back on. 
Best match 


sl Command Prompt 
App 


Apps 
“> Git CMD 


MM 16426 Cross Tools Command 
Prompt for VS 2019 


E Reset the Visual Studio 2019 
Experimental Instance 


Search the web 


P md 
Settings (1) 


Figure A3.3. Steps to open a CMD window. 


EI Command Prompt - python 


Command Prompt 


Open 

Run as administrator <a e 
Open file location 

Pin to Start 


Pin to taskbar 


Figure A3.4. Steps to check whether Python 3.7.9 is installed. 


Download 


Mad A lactación Wanager Setup To> 


PJ DA reren 0G Vpr 


Artei by Kath Mar 3 


COPY E © 2026-3013, Mindw.org Projet 
hap jmanan ara 


Tris îs free ocita are; cea the product documentation ce source ocde. for Copying anc 
ps id i sa los. Taea în HO WEIRRANTY, nA wun ai aiad WARRANTY OF 
NPRCHANTARI IT ror cf FPSS FOR ANY PARTICI 28 PEOP 


Thee tod wl guide you Unoug : Ja avi m se stup of Une MEON în lis Ma opa 
tware (mirgm-cet) ne your cerea ter, redondeo you the epet ines to 
VAR NO RNA TIA comoanerti of the MING LAWI DNDN. 


zisa fast Lam salep hss ban weghda, ra s oud inhu U MW imala on 
Vanager cicardy, (erher the O 1 minga yt e variat. ce în AN creevepet 
Srna to vor pte'erente), when you wish 10 adc oF to remeve Compo tens, or to 
EPIA HU MINOW coftware PAIN. 
View Leenze > [vetas 


Figure A3.5. Steps to install MinGW. 


mingw-get version 0.6.3-pre-20170905-1 


$ 


Step 1: Specify Installation Preferences 
Installation Directory 
C:\MinGw 


Change 


If you elect to change this, you are advised to avoid any choice of directory which 
includes white space within the absolute representation of its path name. 


User Interface Options 


Both command ine and graphical options are avadable. The command line interface 
always supported; the alternative only if you choose the following option to ... 
+s» also instal support for the graphical user interface. 


Program shortcuts for launching the graphical user mterface should be installed ... 
© ... just for me (the current user), of ... for 4 Oo 
A ... in the start menu, and/or... El ... on the desktop. 
* selection of this option requires administrativo privilege. 


View Licence Continue 
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2) MAGN instalatia Manager 
instalator Package Settings 


dace Tastalled Version 


o00c00 


Figure A3.6. MinGW. 


2) Mew testallstion Manager 
lestallaticn Package Settiegs 
Taatalled Version 


Mark Al Upgrades 


App Changes 


Figure A3.7. Update catalogue. 


When you have finished the installation of all the 
packages, you need to set the environment variables in 
your computer. To do this, search for “edit the system 
environment variables” in the search bar of your com- 
puter, and then click “Edit the system environment 
variables” and the “System Properties” window will 
show up. Go to the “Advanced” menu and then click 
“Environment Variables...” as shown in Figure A3.8. 

Follow the steps in Figure A3.8 to open the 
environment variable setting window as in Figure 
A3.9. Follow the steps in the figure to add a new 
path. First select “path” and then click “Edit...”, 
step 1 and 2, respectively in the left panel of Figure 
A3.9. In the right panel of the dialog shown in 
Figure A3.9, click “New” and then add “C:\ 
MinGW bin” to finish the setting of the path. 

Finally, open a CMD window by following the 
steps shown in Figure A3.3 to check whether the 
Fortran compiler has been installed correctly by 
following the steps in Figure A3.10. When you 
type “gfortran”, it will show an error warning. 
This is because no input files were specified as an 
argument to the gfortran command. The Fortran 
compiler has been successfully installed. 


Install R 3.6.3 


Download R 3.6.3 from https://cran.r-project. 
org/bin/windows/base/old/3.6.3/, and install 
it in your computer. Once finished, set the envi- 
ronment variables for R following the two steps in 
Figure A3.11. 

After setting the environment variables, you 
can now open a CMD window by following the 


Repasitery Version Description 


A Basie Mint Tnstadlotion 
The GRY Ada Compilar 

The CEI FORTEAS Compiler 
The HY Coe Compiler 

The HY Objective Compiler 

A Basie IETS Installation Gated 


Deveription 

An HOTS Installation for Min Developers (sete) 
A Borie Hino Lastallation 

The CY Abe Compiler 

The CY FORTRAN Compiler 

The CY Co Compiler 

The CY Cbjectivene Compilar 

A Basie RSTS Tnatallation fata) 


steps shown in Figure A3.3 and follow the steps in 
Figure A3.12 to see whether you have installed R 
3.6.3 correctly. 

After the installation, you need to install the 
Python packages and R packages needed in the 
training course and then compile the Fortran code. 
Locate the folder of CarboTrain, and copy the path 
of that folder as shown in Figure A3.13. 

Once you have copied the path, open a CMD 
window by following the steps shown in Figure 
A3.3 and type in the following commands to 
install the Python packages and R packages: 


1. rmdir /s/q C:\Users\ YOUR_USER_NAME\ 
Documents\R\win-library\3.6\ 


2. cd path_you_copied #Press Enter to confirm 


3. pip3 install pyproj-2.6.1.postl-  HPress Enter to confirm 


cp37-cp37m-win_amd64.whl 


4. pip3 install basemap-1.2.2- #Press Enter to confirm 


cp37-cp37m-win_amd64.whl 


5. pip3 install -r requirements.txt  #Press Enter to confirm 


6. Rscript Rinstall_packages_win.R #Press Enter to confirm 


In order to run the TECO model properly, you also 
need to compile the source code. Go to the TECO 
source code folder under CarboTrain>Source_ 
code—TECO_2.3 and copy the full path as we 
just did in the last step. Once you have copied the 
path, open a CMD window by following the steps 
shown in Figure A3.3 and type in the following 
command to compile the source code. You may 
ignore any warnings that come up. 
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1. gfortran -o TECO_2.3.exe #Press Enter to confirm 


TECO_2.3.f90 


Windows users are now ready to run the practice 
sessions for each unit, and may skip to the sec- 
tion Uses of CarboTrain, below. We will now provide 
instructions on how to install CarboTrain on a 
macOS computer. 


System Properties 


Computer Name Hardware Advanced System Protection Remoto 


You must be logged on as or to make most of these changes 


Pedomance 
Visual effects, processor, 


User Profiles 
Desktop setings related to your signa 


. Memexy U33990., and virtual memory 


Statup and Recovery 
System statup. sytem falre, and debugging information 


Settings 


0 — > ees 


Figure A3.8. Steps to set the environment variables under 
“System Properties”. 


User variables for yo336 

Variable Value 

coo Cicdo\edo.exe 
Urers\yg338\OneDrive - Northern Arizona University 
Urersiyg}\OneDrive - Northern Arizona Ueiversity 
Very } App Oata\Local Programs Python Python dT Script 
Program He Jet Brevis PyCharm 2020.1 Zanc 
User yg apps Locah Temp 
iers\wo3 AccD Loca Temo 


New.. 


Valve 
coo Chedekedoere 
Comspec CAWINDOWS systemi2\cond.cve 
OriverDate CA Window  Systeen Drivers DeiveeDate 
NUMBER OF PROCESSORS 


COM EXE: BAT. CMO VBS: VBE IS: ISE: WSE WSHEMSC 


Je = 
cc == 


Cancel 


INSTALLATION ON MACOS 


Install Python 3.7.9 


Download Python 3.7.9 from https://www. 
python.org/downloads/release/python-379/ 
(select “macOS 64-bit installer”) and install it in 
your MacBook. Once you have finished, open a 
terminal to test the installation following the two 
steps in Figure A3.14. 

Once you have opened a Terminal, follow the 
steps in Figure A3.15 to check whether you have 
installed Python correctly. 


Fortran Compiler 


Download “gcc-x.x-bin.tar.gz” and “gfortran- 
x.x-bin.tar.gz” from http://hpc.sourceforge.net, 
where x.x indicates the different versions. Then 
type in the following commands: 


1. cd & cd Downloads #Press Enter to confirm 


2. gunzip gcc-x.x-bin.tar.gz Press Enter to confirm 


3. gunzip gfortran-x.x-bin. — +fPress Enter to confirm 


tar.gz 


4. sudo tar -xvf gcc-x.x-bin. #Press Enter to confirm 


tar -C / 


5. sudo tar -xvf gfortran- #Press Enter to confirm 


x.x-bin.tar -C / 


You may be asked to type in your password to con- 
tinue. Once finished, you may open a Terminal to 
check Figure A3.16. 


CABregram Fes KEANNA Corpeeation Phy N Common 
C:\Program Files NADIA Corporabos! NVIDIA NOUSR 
CABregram Fie PUTT, 

CAMINGVA in 

C:\Program Fees \dotnet\ 

CAProgram Flees Mecrosett SOL Served IO Tools Bina’ 
C:\Program Files Mecrosctt SOL Server Chert SOKODBO ITA 
CAPregram Fled MATLAB\R20190 ba 

C:\Program Files MKTEX 2.9 ndo bens 60) 

C:\Program Filed R362 

C:\Program Făesi Gai emd 

NSpremBcor system}? 

NSyter®oot's 

RMB Dm bem 
SSVSTEMROOTN SystemI2 WadomP ower shv I A 


XSVSTEMROOTAA SytemI N OpenSSH 


CAU yg IA Documents R minidan, z E 
Credo (=) CAMinGWMbin 
OS bo & 
[[Ewtacmoo II E 


Ex) 


Cancel 


Figure A3.9. Steps to add a new path for MinGW. 
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ZU Command Prompt 


Figure A3.10. Test Fortran compiler installation. 


Program Files (x26)\NVIOIA Corporation Phy A Common 
Program Files\NVIDIA Corporation NADIA NOUSR 

\Program Files PUTIN 
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Program Files\ dotnet 


Program Files\Microsoft SOL Server 130 Tools \ fina 


o0o0o0oO0O0oaOnoooo 


Program Files\ ct SQL Server Chera SOMODBO 170, Tool 
Program Files\M BRO 1 
Program Fes ATA LA mite bimaóe 
Program Files\R\R-2.6.3\bin 
Program Files\Gticmd 
MSystemReot Nsystem3? 
XSystemRot% 


MSystemRoce MSystemI2\ Woe 
SSYSTEMROOTS \System32\' 


TH \System32\ 


Figure A3.11. Set environment variables for R. 


Install R 4.0.5 


Download R 4.0.5 from https://cran.r-project. 
org/bin/macosx/ (file name is R-4.0.5.pkg) and 
install it in your computer. After installation of R, 
run the following commands to create the soft 
links: 


1. In -s /Library/ #Press Enter to confirm 
Frameworks/R. 
framework/Resources/R 


/usr/bin/local/R 


2. In -s /Library/ #Press Enter to confirm 
Frameworks/R. 

framework /Resources/ 

Rscript /usr/bin/local/ 


Rscript 


You also need to open a Terminal to check whether 
the installation is successful (Figure A3.17). 

Once you have installed the three software sys- 
tems required, you need to install some Python 


and R packages and compile the source code of 
the TECO model. To do this, copy the CarboTrain 
folder to your Desktop, open a Terminal and 
type in the following commands to install the 
packages: 


1. cd && cd Desktop/ #Press Enter to confirm 


CarboTrain/ 


#Press Enter to confirm 
and then Enter the 


2. /bin/bash -c “$(curl 
-fsSL https: / /raw. 
githubusercontent.com/ 
Homebrew/install/ 


password and press 
Enter to confirm the 


master/install.sh) ” installation 


w 


. brew install geos #Press Enter to confirm 


A 


. brew install proj #Press Enter to confirm 


«n 


. pip3.7 install basemap- #Press Enter to confirm 


1.2.2rel.tar 


6. pip3.7 install -r #Press Enter to confirm 


requirements. txt 


7. Rscript #Press Enter to confirm 


Rinstall_packages.R 


Next, you can compile the source code of TECO 
as follows: 


1. cd && cd Desktop/ #Press Enter to confirm 
CarboTrain/ 


Source_code/TECO_2.3 


2. gfortran -o TECO_2.3.exe 
TECO_2.3.f90 


#Press Enter to confirm 


When you see the information shown in Figure 
A3.18, click “Install” to install the tool. After 
installation, run the commands: 


1. cd && cd Desktop/ #Press Enter to confirm 
CarboTrain/ 


Source_code/TECO_2.3 
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Figure A3.13. Locating the path of CarboTrain. 


2. gfortran -o TECO_2.3.exe #Press Enter to confirm 


TECO_2.3.f90 


You now have all the required software installed 
and are ready for the practices in the training 
course and in the chapters of this book. 


USES OF CARBOTRAIN 


CarboTrain will be used for the practices in Units 
2 to 10. Units 2 to 4 make use of the matrix ver- 
sion of the TECO ecosystem model and Unit 5 is 
about traceability. In Unit 6, you will practice data 
assimilation (DA) with a simple version of TECO. 
In Units 7 to 9, you will use an intermediately 
complex version of the TECO model to perform 
data assimilation. Unit 10 is the practice for deep 


“Holdin 


ABSOLUTELY N 
under 
distrit on 


running 


“help()°* 


interface to 


bit) 


WARRANTY. 
conditi 
detail 


certa 


in an English locale 


with many contributors. 
i "mati 


and 
in publicati 


for on-line help, or 


help. 


Copy the path (The path will be 
different on different computers) 


learning with a process model, which is called 
PROcess-based deep learning and DAta-driven 
modeling (PRODA). The instructions for each 
practice are described in detail in the correspond- 
ing unit. Here we describe the general steps of 
how to use the software. 

To use CarboTrain, you first need to launch it. 
In Windows, copy the path of CarboTrain, locate 
to the path you copied, and run the software in a 
CMD window as below: 


1. cd path_you_copied #Press Enter to confirm 


2. python main.py #Press Enter to confirm 


In macOS, use the following commands to run the 
software: 
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Figure A3.14. The location of Terminal in macOS. 


ee © gaoyuan — Python — 80x24 


Last login: Fri Oct 2 21:46:53 on console A 
gaovuans-Mac;~ gaoyuans [pychon3 | em @ Type IN python3 
Python 3.7.9](v3.7.9:13c94747c7, Aug 15 2020, BROMAS 


ang 6 clang-608.8.57)] on darwin 
Type "help "copyright", "credits" or "license" for more information. 
>>> l = 

Check the version 


Figure A3.15. Steps to check Python version. 


..0. @ gaoyuan — -bash — 80x24 


Last login: Fri Oct 2 22:21:08 on ttyse00 
(| = 


fortran: fatal error: no input files 
ompilation terminated. 
gaoyuans-Mac:~ gaoyuan 


Figure A3.16. Check Fortran compiler installation. 
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R version 4.0.5 (2021-03-31) | — 


opyrig 0 e ounde 
Platform: x86_64-apple-darwin17.0 (64-b1 


"Shake and Throw" 


You are welcome to redistribute it under certain conditions. 
Type 'license()' or 'licence()' for distribution details. 


~~ CarboTrain — R — 98x3 
[yuangao@yuangaos-MacBook-Pro CarboTrain SS (1) 


12 E Check the version 


2 ax Statistical Computing 
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Figure A3.17 Steps to check R version. 


The “make” command requires the command line 
A developer tools. Would you like to install the tools 
now? 


| 
| 
1 
| Choose Install to continue. Choose Get Xcode to 
H 


the command line developer tools from the App Sto 


Get Xcode Not Now 


Figure A3.18. Installation of the command line developer tools. 


1. cd && cd Desktop/ #Press Enter to confirm 
CarboTrain/ 


2.export QT_MAC_WANTS_ #Press Enter to confirm 
LAYER=1 


3. Python3.7 main.py #Press Enter to confirm 


Once you have launched the software, you will 
see a GUI similar to Figure A3.1. The GUI of the 
software consists of two parts: exercise selection, 
and exercise configuration, as shown in Figure 


tall Xcode and 


Install 


Exercise selection 


Exercise configuration 


A3.19. Figure A3.19. The CarboTrain GUI showing the two parts. 


Each tab shown in the “Exercise configuration” 
part contains one or a set of exercises, as shown in 
Table A3.1. 

The general steps for each practice are: (1) 


TABLE A3.1 


Exercises in different tabs 


select a unit, (2) select an exercise, (3) config- 


A | Config Exercise (s) 
ure the exercise you just selected, and (4) run the 
exercise. For example, in order to run Exercise lin 2 Exercise 2 of unit 2 
Unit 10, you need to go through the steps shown 3 and 4 Exercises of units 3 and 4 


in Figure A3.20. The first two steps are for select- 5 
ing an exercise and step 3 is for configuring the 
exercise. The configuration required may be dif- 
ferent depending on the details of the exercise. For 
this example, configuration involves selecting the 


Exercises of unit 5 
6 Exercises of unit 6 


7 Exercises of unit 7 


output folder. In some exercises, you need to mod- 9.2 and 9.3 Exercises 2 and 3 of unit 9 
ify the source code, change the settings or custom- 10.1 Exercise 1 of unit 10 


ize the exercise according to your own questions. 10.2 
Once you have finished exercise selection and 


Exercise 2 of unit 10 


10.3 Exercise 3 of unit 10 


8 and 9.1 Exercises of unit 8 and exercise 1 of unit 9 
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o | Run the exercise | (Z Solecta output folder (Configure the exercise) 


art 


Figure A3.20. Steps to run Exercise 1 in Unit 10. 


configuration, it is ready for step 4: clicking “Run 
Exercise” to start the exercise. 

When you have clicked the “Run Exercise” but- 
ton in each exercise, a pop-up window will show 
up with the message “Task submitted!” and you 
need to click “OK” before you can run the exercise. 


362 


@ Finished! 


OK 


(D Task submitted! 


Figure A3.21. These pop-up windows appear when starting 


the running of an exercise, and when it is finished. 


Once an exercise is done, a pop-up window with 
the message “Finished!” will inform you that the 
exercise has been completed with no error (Figure 
A3.21). If any errors occur, please double-check 
the steps you went through to run the exercise. If 
you still cannot figure out the cause of the error, 
please ask instructors. 
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soil layers, 53-54 
unified diagnostic system for uncertainty analysis, 74-75 
vertical distribution of SOC across U.S., 334-336, 335 
compartmental dynamic system, 16, 57-64 
autonomous systems (properties and long-term 
behavior), 60-62 
autonomous vs. nonautonomous systems, 59—60 
classification, 59-64, 59 
control theory concepts, 64 
definition, 58—59 
linear systems, 60-63 
linear vs. nonlinear systems, 60 
mass balance of single compartment, 58 
matrix representation, 57 
nonautonomous systems, 62-64 
nonlinear systems, 61, 62, 63-64 
quizzes, 64 
stability analysis near equilibria, 61 
suggested reading, 64; see also time characteristics of 
compartmental systems 
conversion of mass law, 31 
Copernicus LAI time series, 232, 234 
coupled carbon-nitrogen matrix models, 45-56 
application of matrix representation, 47 
biological N fixation, 48 
case study, 47 
Community Land Model version 5 (CLM5), 47-54, 48, 
49, 55-56 
Duke Forest Free-Air CO, Enrichment (FACE), 47 
global validation of CLM5, 55-56 
litterfall, 48 
matrix representation, 45-47 
metrological datasets, 56 
nitrogen transfers, 45 
quizzes, 56 
suggested reading, 56 
Terrestrial Ecosystem (TECO) model, 45-47, 46 
Coupled Model Intercomparison Project Phase 5 (CMIP5), 
13 
Harmonized World Soil Database (HWSD), 319 
PRODA, 319 
transient traceability framework, 147, 153 
Coupled Model Intercomparison Project Phase 6 (CMIP6), 
139 
traceability analysis, 146, 163, 167—169, 168, 169 
CWD (coarse woody debris), 54, 117 


D 


DALEC (Data Assimilation Linked Ecosystem Carbon), 
229-230, 229 
data assimilation, 173-180 
Bayesian statistics, 181-183, 183 
CarboTrain, 197-205 
choose a model, 216-218 
convergence of MCMC results, 186-187 
cost function, 177, 185, 186, 192-193, 200-201, 201, 
213,218 
defining objective, 197, 216 
deviance information criterion (DIC), 180 
Ecological Platform for Assimilating Data (EcoPAD), 174, 
295-296 
ecosystem carbon sequestration, 175, 175 
elements for realistic model, 174 


estimate parameters, 203, 220-221 
exercises, 204-206, 205, 206 
flux data, 178 
Free-Air CO, Enrichment Model Data Synthesis (FACE- 
MDS), 174-175 
Gelman-Rubin (G-R) diagnostic method, 178, 186 
Harmonized World Soil Database (HWSD), 175 
Markov Chain Monte Carlo (MCMC), 177, 181, 
183-187, 184, 218-220, 220 
Metropolis-Hastings algorithm, 177, 183-185, 
201-203 
Michaelis-Menten model, 179-180 
model choice, 198-200 
need for, 174-176 
optimization method, 201-203, 218-220 
overview, 173-174 
“perfect candidate”, 183-184 
practice, 197-205 
predict height from arm length, 176-177 
prediction, 203-204, 204, 221 
preparing data, 198, 216 
priming, 179 
prior knowledge, 183 
process-based land carbon models, 174 
quizzes, 180, 187 
scientific values, 178—180, 179 
seven step procedure, 176-178, 176, 197-205, 198, 
198, 216-222 
suggested reading, 180, 187, 206 
Terrestrial Ecosystem (TECO) model, 177, 178, 197, 
198-200, 199 
training course example, 182, 183-184 
Data Assimilation Linked Ecosystem Carbon (DALEC), 
229-230, 229, 234; see also CARDAMOM 
approach (data assimilation) 
data assimilation (soil incubation data), 189-196 
3-pool model, 192, 193, 195 
application of, 191-196, 192, 193, 193-196 
cost function, 192-193 
Markov Chain Monte Carlo (MCMC), 193 
maximum likelihood estimates (MLEs), 194 
quizzes, 196 
soil carbon models, 190-191, 191 
soil incubation experiments, 189-190 
suggested reading, 196 
data points, 79 
data set types (parameters and predictions), 245-253 
correlation between posterior parameter values, 252 
entropy, 247, 248 
information contents of model and data, 246, 248-251, 
250 
litterfall data, 249, 250 
model equifinality, 251-252, 251 
model error (inherent), 246 
model predictions, 252 
model uncertainty, 245 
parameter uncertainty, 250 
prediction of land carbon dynamics after data 
assimilation, 252-253 
quizzes, 253 
Shannon information index, 246-248, 253 
short- and long-term information, 248, 249 
soil organic matter (SOM), 246, 247 
suggested reading, 253 
Terrestrial Ecosystem (TECO) model, 246 
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deviance information criterion (DIC), 180 
DGVMs (dynamic global vegetation models), 18 
diagnostic variables in matrix models 
biosphere-atmosphere feedback, 95 
carbon storage capacity and potential, 97-98 
Community Land Model version 5 (CLM5), 96-97 
exercises, 96-98 
mathematical foundation, 95-97 
practice, 95-99 
suggested reading, 99 
Terrestrial Ecosystem (TECO) model, 95, 96-97 
uncertainty, 95 
DIC (deviance information criterion), 180 
donor pool, 4 
donor-pool-dominant transfer, 4-6 
DroughtNet, 293 
Duke Forest Free-Air CO, Enrichment (FACE), 21, 47, 76, 77 
data assimilation, 174-175 
transient traceability framework, 148-152 
dynamic global vegetation models (DGVMs), 18 


E 


Earth system models (ESMs), 16 
land surface models (LSMs), 17—18 
machine learning and neural networks, 315-317, 316 
PRODA, 319-320 
ecological forecasting, 287-292 
data availability to constrain forecast, 290 
disturbance events, 288, 290 
Ecological Platform for Assimilating Data (EcoPAD), 287, 
290-292, 291 
eddy-flux networks, 290, 292 
Long Term Ecological Research (LTER), 290, 292 
models and predictability of terristrial carbon cycle, 
288-290, 289 
quizzes, 292 
real-time sensors, 291 
SPRUCE, 290-292 
suggested reading, 292 
weather forecasting, 287-288 
workflow system to facilitate, 290-292 
Ecological Platform for Assimilating Data (EcoPAD), 174, 
287, 290-292, 291, 293—300 
accessing and working with, 302-303 
Bayesian statistics, 296 
CarboTrain, 303-306, 304, 305 
Celery (job queue), 296, 297 
custom workflow web portal, 303 
data assimilation, 295-296 
datasets, 294-295, 297, 301-302 
docker (virtualization platform), 297-298 
DroughtNet, 293 
Eco-PAD-SPRUCE portal, 303 
exercises, 303-305, 304, 305 
file transfer protocol, 302 
flux observations, 302 
FLUXNET, 293, 302 
forecasting uncertainty exercise, 305-306 
general structure, 294-298 
GitHub, 302 
gross primary production (GPP), 298-300, 299 
Markov Chain Monte Carlo (MCMC), 296, 298 
model-experiment (ModEx) system, 294, 295 
National Ecological Observatory Network (NEON), 293 


photosynthetically active radiation (PAR), 302 
posterior parameters exercise, 303-305, 305 
practice, 301-306 
probability density functions, 296 
quizzes, 300 
Representational State Transfer (RESTful) API, 297, 298 
scientific workflow, 296-297, 296 
SPRUCE, 298-300, 299, 301-306 
structured result storage, 298 
suggested reading, 300 
Terrestrial Ecosystem (TECO) model, 295, 302 
user-model interactions, 300 
web request-response, 294 
why do we need, 293-294 
Energy Exascale Earth System Model (E3SM), SPRUCE, 
211-213 
ESMs (Earth system models), 16 
land surface models (LSMs), 17—18 
machine learning and neural networks, 315-317, 316 
PRODA, 319-320 
Euler method, 26 


F 


FACE (Duke Forest Free-Air CO, Enrichment), 21, 47, 76, 77 
data assimilation, 174-175 
transient traceability framework, 148-152 
Fick's Law, 216-218 
Fine Root Ecology Database (FRED), 212 
fire events, 9, 11 
flow diagrams and balance equations, 23-30 
bank account example, 23, 25 
carbon balance equation, 25-27 
carbon balance equations, 28 
carbon flow diagrams, 23-25, 24, 25 
CENTURY, 27, 27, 31-32, 32 
Community Land Model version 4.5 (CLM4.5), 29, 30 
litter pools, 27 
litterfall, 24, 26 
one-pool carbon model, 25 
ORCHIDEE-MICT, 28, 29, 30 
practice, 31-34, 32, 33 
quizzes, 30 
ReSOM model, 33-34, 33 
respiration, 24, 25 
stock and flow, 23, 31 
suggested reading, 30 
Terrestrial Ecosystem (TECO) model, 27, 27 
two-pool carbon model, 26 
flux data 
data assimilation, 178 
Peatland methane study (data assimilation), 215 
FLUXNET, 293, 302 
forest fires, 3-4, 6, 11 
FRED (Fine Root Ecology Database), 212 
Free-Air CO, Enrichment Model Data Synthesis (FACE-MDS), 
20,47,76,77 
data assimilation, 174-175 
FUN module, 53 
Fundamentals of Ecology (Odum and Odum), 16 


G 


Gelman-Rubin (G-R) diagnostic method, 91, 186 
data assimilation, 178 
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PRODA, 322, 323 
general systems theory, 16 
global soil carbon models (data-constrained uncertainty 
analysis), 263-271 
alternative model structures, 264-267 
Community Land Model version 4 (CLM4), 264 
Community Land Model version 4.5 (CLM4.5), 266, 
270,270 
datasets and data-model fusion, 267-268 
Harmonized World Soil Database (HWSD), 264, 267 
Markov Chain Monte Carlo (MCMC), 263, 267 
microbial growth efficiency (MGE), 267 
Microbial-Mineral Carbon Stabilization (MIMICS), 
266-268, 270, 270 
Northern Circumpolar Soil Carbon Database (NCSCD), 
264, 267 
permafrost soils, 267 
posterior distribution of model parameters, 268-269, 
268 
quizzes, 271 
representative concentration pathway 8.5 (RCP8.5), 
269-270, 269 
sensitivity to initial conditions and model parameters, 
270-271 
soil C decomposition models, 264, 264 
suggested reading, 271 
GPP (gross primary production), 75—76 
Ecological Platform for Assimilating Data (EcoPAD), 
298—300, 299 
Great Lakes, PRODA, 323 
Great Plains, PRODA, 324 
gross primary production (GPP), 75-76 
Ecological Platform for Assimilating Data (EcoPAD), 
298-300, 299 


H 


Harmonized World Soil Database (HWSD), 175, 263-264, 
267 
CMIP(5), 319 
PRODA, 323 
Harvard Forest, 77 
transient traceability framework, 148-152 
Henry's Law, 218 
Houtouwan, 3,4, 11 
HWSD (Harmonized World Soil Database), 175, 264, 267 
CMIP(5), 319 
PRODA, 323 


ILAMB (International Land Model Benchmarking), 
158-160 
benchmarking, 160, 162 
image classification, 311, 314-315, 315 
industrial revolution, 115-116 
information contents of model and data, 246, 248-251, 
250,271 
assimilation of carbon flux measurements, 279-280, 
280 
assimilation of carbon pool and flux measurements, 
281-282, 282 
assimilation of carbon pool measurements, 277-279, 
278 
carbon flux predictions, 283 


INDEX 


CarboTrain, 273-283 
model uncertainty, 273 
posterior distribution of model parameters, 276 
practice, 273-283 
real data assimilation, 283 
without data assimilation, 274-277, 275; see also data set 
types (parameters and predictions) 
input-to-state stability (ISS), 64 
International Land Model Benchmarking (ILAMB), 158-160 
benchmarking, 160, 162 
PCC climate projections, 18, 19, 23, 139 
SS (input-to-state stability), 64 


J 


acobian matrix, 62 


LL 


LabelMe, 312 

LAI (Leaf Area Index), 218, 228 

ake carbon dynamics (controlling mechanisms), 255—262 
Akaike information criterion (AIC), 260 
epilimnetic C dynamics, 256-262, 257, 261 
hypothesis (gravity-driven density currents), 258 
hypothesis (hypolimnetic water), 258-259 
hypothesis (photooxidation of DOC), 258 
hypothesis (varying DOC lability), 257, 260 
Long lake, 256 
Markov Chain Monte Carlo (MCMC), 255 
Metropolis-Hastings algorithm, 259, 260 
model-based hypothesis testing, 256 
model calibration and selection, 259-260 
models and data-model fusion, 255-256, 261 
overfitting, 256, 259 
quizzes, 262 
suggested reading, 262 
thermocline, 256 

and surface models (LSMs), Earth system models (ESMs), 

17-18 

andscapes, vegetation take over, 3, 6, 11 

Leaf Area Index (LAI), 218, 228 

eaf pools 
carbon balance, 6 
carbon from photosynthesis, 6, 7 

itter decay constant, 5 

itter decomposition, 5 

itter pools, 4 
flow diagrams and balance equations, 27 
matrix equation, 39 
number, 38 

itterfall 
coupled carbon-nitrogen matrix models, 48 
data, 249, 250 
flow diagrams and balance equations, 24, 26 
rate, 4 
recipient pool, 4, 5 

Long Term Ecological Research (LTER), 290, 292 

LUNA module, 53 


M 


machine learning and neural networks, 309-317 
activation function, 330 
baseline accuracy, 311 


377 


cell image classification, 311 

CellProfiler Analyst system, 311 

correlation between outputs, 316-317 
cross-validation of parameters of ESMs, 315-317, 316 
email spam filtering, 311 

epoch number, 331 

estimate of variance, 311 

generalization goal, 310 

hyper-parameters, 312, 313, 331-332 

image classification, 311, 314-315, 315 

K-fold cross-validation, 310-311, 310, 313, 315-317 
loss function, 330 

MNIST/Fashion-MNIST datasets, 309, 310, 315 
optimizers, 331 

under/overfitting, 312-314, 313, 333 
overview, 309-311, 309 

practice, 329-336 

PRODA vs. data assimilation alone, 334-336 
quizzes, 317 

retinal photographs, 312 

suggested reading, 317 

train dataset, 310, 313, 315 

training target, 330 

tuning for better performance, 333, 333 
validation error (U shape), 314 


Markov Chain Monte Carlo (MCMC), 124 


and Bayesian statistics, 181-187 

CARDAMOM approach (data assimilation), 230, 231 
Community Land Model version 5 (CLM5), 329 
data assimilation, 177, 183—187, 218-220, 220 
data assimilation (soil incubation data), 193 


Ecological Platform for Assimilating Data (EcoPAD), 296, 


298 
global soil carbon models (data-constrained uncertainty 
analysis), 263, 267 
lake carbon dynamics (controlling mechanisms), 255 
Metropolis-Hastings algorithm, 183-185, 184 
PRODA, 320, 322, 323 
SPRUCE, 213 


matrix algebra, 337-342 


eigenvalues and eigenvectors, 341-342 
linear system, 341 

matrix equations, 340-341 

matrix multiplication, 338-339 

matrix operations, 338 

motivations, 337-338 

quizzes, 339, 341, 342 

Strang's Linear Algebra lecture (MIT), 337 
suggested reading, 342 


matrix approach (model representation), 6-10 


pool-and-flux, 6, 38 


matrix models (developing), 37-43 
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CENTURY, 65-66 

coding and running, 66-69, 67-68, 69 
Community Land Model version 4.5 (CLM4.5), 39 
deriving the matrix equation, 39-42, 40 

litter pools, 39 

matrix version of carbon balance equation, 37-39 
ORCHIDEE-MICT, 39-42, 39, 41-42 

phosphorus, 89-91, 90 

practice, 65—69 

Python, 66-69 

quizzes, 43 

suggested reading, 42 


Terrestrial Ecosystem (TECO) model, 38, 39 
matrix phosphorus model and data assimilation, 87-94 
balance equations, 90-91 
CENTURY, 88 
construction of model, 89—90, 90 
data assimilation example, 89-91, 90 
data selection and description, 89 
matrix approach, 88-89, 93 
model validation and data assimilation, 91 
phosphorus, 87-88 
quizzes, 94 
soil dynamics models, 88 
soil P and other ecosystem properties, 93, 94 
soil P dynamics quantified, 91-93, 92-93, 92 
spin-up, 88, 89 
suggested reading, 94 
supercomputer use, 80, 93 
MCMC (Markov Chain Monte Carlo), 124 
and Bayesian statistics, 181-187 
CARDAMOM approach (data assimilation), 230, 231 
Community Land Model version 5 (CLM5), 329 
data assimilation, 177, 183-187, 218-220, 220 
data assimilation (soil incubation data), 193 
Ecological Platform for Assimilating Data (EcoPAD), 296, 
298 
global soil carbon models (data-constrained uncertainty 
analysis), 263, 267 
lake carbon dynamics (controlling mechanisms), 255 
Metropolis-Hastings algorithm, 183-185, 184 
PRODA, 320, 322, 323 
SPRUCE, 213 
methane, 215; see also Peatland methane study (data 
assimilation) 
metrological datasets, 56 
Metropolis-Hastings algorithm, 91 
data assimilation, 177, 183-185, 184, 201-203 
lake carbon dynamics (controlling mechanisms), 259, 
260 
Markov Chain Monte Carlo (MCMC), 183-185, 184 
MGE (microbial growth efficiency), 267 
Michaelis-Menten model, 179-180, 266 
microbial growth efficiency (MGE), 267 
Microbial-Mineral Carbon Stabilization (MIMICS), 266-268, 
270,270 
MIPs (model intercomparison projects), 73, 139 
CarboTrain, 163 
transient traceability framework, 147, 148, 152-156 
model equifinality, data set types (parameters and 
predictions), 251-252, 251 
model error, inherent, 20, 246 
model intercomparison projects (MIPs), 73, 139 
CarboTrain, 163 
transient traceability framework, 147, 148, 152-156 
model predictions, 252 
model simulations, 79 
modeling (introduction), 13-21 
common everyday models, 14 
modeling workflow, 18-21 
models in research, 14-15, 15 
modes of application, 15 
predicting forward in time, 15 
quizzes, 21 
suggested reading, 21 
system dynamics, 16 


INDEX 


types of land carbon cycle models, 16-18, 18 
ways of using models, 15-16 
what is a model, 13 
modeling workflow, 18-21 
calibrate the model, 20 
choose a model, 19 
data assimilation, 197-205 
design the model experiment, 21 
initialization/spin-up, 21 
seven step procedure, 176-178, 176, 197-205, 198, 
198, 216-222 
software application, 19 
specify the question/hypothesis, 18 
validate the model, 20-21 
verify the model works, 19-20 
Monte Carlo sampling, 80 
Multiscale Synthesis and Terrestrial Model Intercomparison 
Project (MsTMIP), 116 


N 


National Ecological Observatory Network (NEON), 293 
NBP (net biome production), 20 
NCSCD (Northern Circumpolar Soil Carbon Database), 264, 
267 
NEE (net ecosystem exchange), 232 
NEON (National Ecological Observatory Network), 293 
net biome production (NBP), 20 
net ecosystem exchange (NEE), 232 
net primary productivity (NPP), 73, 75-76, 77, 96, 119 
PRODA, 322 
traceability analysis, 142, 142, 165 
transient traceability framework, 148, 151, 151, 152, 
153, 153, 154, 155, 156 
NimBioS workshop (2012), 10, 12 
nitrogen cycle, 17 
nitrogen transfers 
biological N fixation, 48 
coupled carbon-nitrogen matrix models, 45 
nonautonomous ODE system solver and stability analysis, 
103-113 
3-pool model, 107-108, 109, 110 
analytical solution, 104-112 
case study, 111-112 
first-order non-homogeneous scalar equation, 104 
global attractor, 109-111 
homogeneous nonautonomous ODEs system, 105 
n-pool model, 106-107 
non-homogeneous nonautonomous ODEs system, 
105-106 
one-pool carbon model, 104-105 
quizzes, 112-113 
stability, 108-111 
suggested reading, 112 
nonautonomous systems, 10, 77 
nonlinear microbial models, 10 
Northern Circumpolar Soil Carbon Database (NCSCD), 264, 267 
NPP (net primary productivity), 73, 75-76, 77, 96, 119 
PRODA, 322 
traceability analysis, 142, 142, 165 
transient traceability framework, 148, 151, 151, 152, 
153, 153, 154-156 
numerical simulation models, 14 
numerical weather prediction (NWP), 14-15, 288 


O 


observational error, 227 
ODEs (ordinary differential equations), 103; see also 
nonautonomous ODE system solver and stability 
analysis 
one-pool carbon model, 25 
nonautonomous ODE system solver and stability 
analysis, 104—105 
ORCHIDEE, 28 
semi-analytical spin-up (SASU) method, 129 
ORCHIDEE-CNP, 25 
ORCHIDEE-MICT, 28, 29, 30 
matrix models (developing), 39-42, 39, 41-42 
pool number, 38 
sensitivity analysis with matrix equations, 80, 81-82 
vertical soil layers, 42 
ordinary differential equations (ODEs), 103; see also 
nonautonomous ODE system solver and stability 
analysis 
overfitting, 227 
lake carbon dynamics (controlling mechanisms), 256, 
259 
machine learning and neural networks, 312-314, 313, 
333 


P 


PAR (photosynthetically active radiation), 302 
Peatland methane study (data assimilation), 215-223 
flux data, 215 
Markov Chain Monte Carlo (MCMC), 218-220, 220 
quizzes, 223 
seven step procedure, 216-222 
soil temperature, 221 
suggested reading, 223 
Terrestrial Ecosystem (TECO) model, 216-222, 217, 
219, 220-222 
uncertainty in methane modeling, 215-216 
PET (plant functional types), 232 
phosphorus 
genetic information carriers, 87 
matrix model and data assimilation, 87-88 
photosynthetic products, 4 
photosynthetically active radiation (PAR), 302 
plant functional types (PFT), 232 
practice 
Community Land Model version 5 (CLM5), 329-336, 
330 
data assimilation, 197-205 
diagnostic variables in matrix models, 95-99 
Ecological Platform for Assimilating Data (EcoPAD), 
301-306 
flow diagrams and balance equations, 31-34, 32, 33 
information contents of model and data, 273-283 
machine learning and neural networks, 329-336 
matrix models (developing), 65-69 
semi-analytical spin-up (SASU) method, 129-135 
SPRUCE, 237-241, 237 
traceability analysis, 163-170 
process-based land carbon models, 80 
PROcess-guided deep learning and DAta-driven modeling 
(PRODA), 319-328 
big data, 320 
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Community Land Model version 3.5 (CLM3.5), 320 

Community Land Model version 5 (CLM5), 320-321, 
321, 323, 324, 325, 327-329, 334-336 

Coupled Model Intercomparison Project Phase 5 
(CMIP5), 319 

deep learning model, 322-323 

Earth system models (ESMs), 319-320 

Gelman-Rubin (G-R) diagnostic method, 322, 323 

Great Lakes, 323 

Great Plains, 324 

Harmonized World Soil Database (HWSD), 323 

Markov Chain Monte Carlo (MCMC), 320, 322, 323 

model representation of SOC across U.S. sites, 323 

net primary productivity (NPP), 322 

process-based model, 320-322 

quizzes, 328 

reference SOC data products, 323 

SOC distribution (realistic representations), 327-328, 
327 

soil carbon and site-level data assimilation, 322 

soil layer numbers, 321 

soil layer thickness, 321 

spatial distribution of SOC across U.S., 323-324, 324, 
324 

suggested reading, 328 

vertical distribution of SOC across U.S., 325-327, 325, 
326, 334-336, 335 

workflow of PRODA, 320-323 


Python 


Q 


advanced variables and operators, 346—350 
Boolean values, 344 

CarboTrain, 66—69, 67—68, 69, 197—205 
class operator, 348—349, 349 

code block, 344, 344, 348 

data assimilation, 197—205 

first program, 343—344 

function operator, 347—348, 347 
introduction to programming, 343-351 
keras package, 331 

list variable, 346-347, 347 

module operator, 349-350, 350 
namespaces, 349 

quizzes, 351 

shell window, 343 

suggested reading, 350-351 

syntax, 343, 347, 348 

variables and operators, 344-346 
workflow of function calls, 348, 348 


quizzes 
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benchmarking, 162 

compartmental dynamic system, 64 

coupled carbon-nitrogen matrix models, 56 
data assimilation, 180 

data assimilation (soil incubation data), 196 
data set types (parameters and predictions), 253 
ecological forecasting, 292 


S 


matrix algebra, 339, 341, 342 

matrix models (developing), 43 

matrix phosphorus model and data assimilation, 94 

modeling (introduction), 21 

nonautonomous ODE system solver and stability 
analysis, 112-113 

Python, 351 

semi-analytical spin-up (SASU) method, 121 

sensitivity analysis with matrix equations, 85 

SPRUCE, 214 

theoretical foundations, 12 

time characteristics of compartmental systems, 127 

traceability analysis, 146 

transient traceability framework, 156 

unified diagnostic system for uncertainty analysis, 78 


SASU see semi-analytical spin-up (SASU) method 
scientific hypotheses, 14 

secondary succession, 4, 6 

semi-analytical spin-up (SASU) method, 115-121 


accelerated decomposition (AD), 116-117 

accelerated spin-up (ASU), 117 

CABLE, 115, 117, 118, 119-120, 119, 120, 129 

Community Land Model version 5 (CLM5), 129 

computational efficiency, 120-121, 129 

exercises, 130-135 

industrial revolution, 115-116 

mathematical foundation, 117-119 

native dynamics spin-up (ND), 116, 118, 129, 131 

nonlinear systems, 132-135 

ORCHIDEE, 129 

passive soil turnover rate/soil carbon, 134 

practice, 129-135 

quizzes, 121 

soil dynamics models, 89 

spin-up, 115-117 

steady state, 116 

suggested reading, 121 

supercomputer use, 116, 120-121 

Terrestrial Ecosystem (TECO) model, 129-135, 130, 
133 


sensitivity analysis with matrix equations, 79-85 


one-at-a-time sensitivity analysis, 82-85, 83, 84 
ORCHIDEE-MICT, 80, 81-82 

quizzes, 85 

sensitivity analysis, 79-80 

Sobol sensitivity analysis, 80-82, 83 

spatial pattern, 85 

suggested reading, 85 

supercomputer use, 80 

variance-based approach, 80 


Shannon information index, 246-248, 253 
SiB2 (Simple Biosphere Model), 17 

Silver Springs (Florida), 16-17, 17 

Simple Biosphere Model (SiB2), 17 

Sobol sensitivity analysis, 80-82, 83 

SOC (soil organic carbon), 5 


Ecological Platform for Assimilating Data (EcoPAD), 300 


flow diagrams and balance equations, 30 


global soil carbon models (data-constrained uncertainty 


analysis), 271 
lake carbon dynamics (controlling mechanisms), 262 
machine learning and neural networks, 317 


distribution across U.S. PRODA model, 323-324, 324, 
324, 325-327,325,326, 334-336, 335 
three-pool models, 6 


software application, modeling workflow, 19 
soil carbon cycling, 4 
soil dynamics models 


INDEX 


matrix phosphorus model and data assimilation, 88 
semi-analytical spin-up (SASU) method, 89 
spin-up, 88, 89 
soil incubation, 5 
soil incubation experiments, data assimilation (soil 
incubation data), 189-190, 190 


soil layers, coupled carbon-nitrogen matrix models, 53-54 


soil organic carbon (SOC), 5 
distribution across U.S. PRODA model, 323-324, 324, 
324, 325-327,325, 326, 334-336, 335 
three-pool models, 6 
soil organic matter (SOM), 4 
CARDAMOM approach (data assimilation), 230 
data set types (parameters and predictions), 246, 247 
modeling workflow, 19 
single-pool, 19 
soil phosphorus, terrestrial carbon dynamics, 87 
Soil-Plant-Atmosphere (SPA) model, 229 
soil respiration, 24, 25 
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